Memory Allocators and Ceph

All of lore.kernel.org
 help / color / mirror / Atom feed

* Memory Allocators and Ceph
@ 2015-05-27 17:40 Robert LeBlanc
       [not found] ` <CAANLjFpErC4xbwgJgZGWFdMaWQ1Q4otBksyRqP0jfWKnqVacog-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Robert LeBlanc @ 2015-05-27 17:40 UTC (permalink / raw)
  To: ceph-users@lists.ceph.com, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 3276 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

With all the talk of tcmalloc and jemalloc, I decided to do some
testing og the different memory allocating technologies between KVM
and Ceph. These tests were done a pre-production system so I've tried
to remove some the variance with many runs and averages. The details
are as follows:

Ceph v0.94.1 (I backported a branch from master to get full jemalloc
support for part of the tests)
tcmalloc v2.4-3
jemalloc v3.6.0-1
QEMU v0.12.1.2-2 (I understand the latest version for RH6/CentOS6)
OSDs are only spindles with SSD journals, no SSD tiering

The 11 Ceph nodes are:
CentOS 7.1
Linux 3.18.9
1 x Intel E5-2640
64 GB RAM
40 Gb Intel NIC bonded with LACP using jumbo frames
10 x Toshiba MG03ACA400 4 TB 7200 RPM drives
2 x Intel SSDSC2BB240G4 240GB SSD
1 x 32 GB SATADOM for OS

The KVM node is:
CentOS 6.6
Linux 3.12.39
QEMU v0.12.1.2-2 cache mode none

The VM is:
CentOS 6.6
Linux 2.6.32-504
fio v2.1.10

On average preloading Ceph with either tcmalloc or jemalloc showed an
increase of performance of about 30% with most performance gains for
smaller I/O. Although preloading QEMU with jemalloc provided about a
6% increase on a lightly loaded server, it did not add or subtract a
noticeable performance difference combined with Ceph using either
tcmalloc or jemalloc.

Compiling Ceph entirely with jemalloc overall had a negative
performance impact. This may be due to dynamically linking to RocksDB
instead of the default static linking.

Preloading QEMU with tcmalloc in all cases overall showed very
negative results, however it showed the most improvement of any tests
in the 1MB tests up to almost 2.5x performance of the baseline. If
your workload is guaranteed to be of 1MB I/O (and possibly larger),
then this option may be useful.

Based on the architecture of jemalloc, it is possible that with it
loaded on the QEMU host may provide more benefit on servers that are
closer to memory capacity, but I did not test this scenario.

Any feedback regarding this exercise is welcome.

Data: https://docs.google.com/a/leblancnet.us/spreadsheets/d/1n12IqAOuH2wH-A7Sq5boU8kSEYg_Pl20sPmM0idjj00/edit?usp=sharing
Test script is multitest. The real world test is based off of the disk
stats of about 100 of our servers which have uptimes of many months.

- - ----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVZgGRCRDmVDuy+mK58QAAM20QAJh0rR0NIQABCkMjiluP
f/mcIiy4MQfFd5RJ9/ZlMRDQ0KDwW7haRm58QE0S/l6ZZ3+z7MqsQOW8KHJE
Y75YjEdsl7zrLLcB4wNnUKJXZrPwzFReTtLbXsNB8h73tbzaLp3y9711gbNf
EQQujiSp5XDiOK+d+H0FVGp4AfVmFvlO5gjQMSUcUt58qN6BsnD8NbRLEvKf
S2WzvJjFO7g1HqWr5QssKGb+1rvze2Z2xByURU8yKVpdX59EIhfzPdgadp/n
AJGR2pXWGgW2CQ3ce7gN7cr32cjjWbmzpdr0djgVB5/Y1ERU8FvwNFIwFa6N
eFUKCohW5UjMw8CcO9CzUQtQxgKnqeHcyVe6Loamd2eZ+epIupFLI3lQF6NU
GSdBV/8Ale1SJuhShY6QnEJFav8nLTvNvlDF/NiBoSUMtnsl5fDTpLH3KA2w
o8sT2dcDEJEc9+kzUrugUBElinjOacFcINU3osYZJ0NNi4t1PDtPTUiWChvT
jZdpWVGVpxZ3w46csACJZxY0lP/Kd6JoSH+78q7wNivCHeHT7c3uy8KGbKA7
fecFaHBAsCYliX1tDN/abZFVhEvdb8AuTGqGkZ7xHj0PAUyddObYGjkStVUw
dGOH+nurnFZ5Qqct/gvcbxggbOTGunHLGwtALT5EAtTB1ThlfpVQImy5vKl0
aOER
=YTTi
-----END PGP SIGNATURE-----

[-- Attachment #2: multitest --]
[-- Type: application/octet-stream, Size: 13894 bytes --]

#!/usr/bin/perl
#########################################################
#  multitest, by Marcus Sorensen, BetterServers Inc     #
#  modified by David Collins and Robert LeBlanc, EIG    #
#  Licensed under the Open Software License version 3.0 #
#  http://opensource.org/licenses/OSL-3.0               #
#########################################################
use strict;
use Data::Dumper;

$| = 1;
my $colors = { red => "\e[1;31m", def => "\e[0m", green => "\e[1;32m", cyan => "\e[1;36m" };
my $restbetweentests = 15;
my $testtime = 300;   #seconds
my $testsize = "12500MB";
my $testjobs = 8;
my $testiodepth = 8;
my $testname = "multiiotester";
my %final_out;

unless ( `which fio 2>/dev/null`) {
  print "No executable 'fio' found in path, exiting\n";
  exit;
}

print <<EOF;
$colors->{red}
Multiple IO Tester$colors->{def}

  This application emulates a busy server in several states by launching multiple
threads that do various types of IO. This allows us to see what the consequences
are of running in a multitasking environment. This test uses direct IO and 
invalidates caches between tests, testing the disk, not the memory.

$colors->{red}NOTE:$colors->{def} You need at least 100GB of free space in your current working directory.

The following tests currently consist of:

  8 sequential readers
  8 sequential writers
  8 mixed seqential readers/writers (random choice per IO)
  8 random readers
  8 random writers
  8 mixed random readers/writers (random choice per IO)
  A real work simulation of varied read/write requests of various sizes weighted to smaller I/O and 65% read 35% write.

Feel free to modify the script to meet your needs. Enjoy!

The test should take less than 3 hours. Press <ENTER> to begin...
EOF
<STDIN>;

my $tests = { 'read-1024k'      => { 'order' => 1, 
                               'block' => '1024k', 
                               'output' => { 'multiiotester'=>'4', '2'=>'5', '3'=>'6' },
                               'name' => 'sequential read' }, 
              'write-1024k'     => { 'order' => 2, 
                               'block' => '1024k', 
                               'output' => { 'multiiotester'=>'20', '2'=>'25', '3'=>'47' },
                               'name' => 'sequential write' }, 
              'rw-1024k'        => { 'order' => 3, 
                               'block' => '1024k', 
                               'output' => { 'multiiotester'=>'4,20', '2'=>'5,25', '3'=>'6,47' },
                               'name' => 'seq read/seq write' },

              'read-256k'      => { 'order' => 1,
                               'block' => '256k',
                               'output' => { 'multiiotester'=>'4', '2'=>'5', '3'=>'6' },
                               'name' => 'sequential read' },
              'write-256k'     => { 'order' => 2,
                               'block' => '256k',
                               'output' => { 'multiiotester'=>'20', '2'=>'25', '3'=>'47' },
                               'name' => 'sequential write' },
              'rw-256k'        => { 'order' => 3,
                               'block' => '256k',
                               'output' => { 'multiiotester'=>'4,20', '2'=>'5,25', '3'=>'6,47' },
                               'name' => 'seq read/seq write' },


              'read-64k'      => { 'order' => 1,
                               'block' => '64k',
                               'output' => { 'multiiotester'=>'4', '2'=>'5', '3'=>'6' },
                               'name' => 'sequential read' },
              'write-64k'     => { 'order' => 2,
                               'block' => '64k',
                               'output' => { 'multiiotester'=>'20', '2'=>'25', '3'=>'47' },
                               'name' => 'sequential write' },
              'rw-64k'        => { 'order' => 3,
                               'block' => '64k',
                               'output' => { 'multiiotester'=>'4,20', '2'=>'5,25', '3'=>'6,47' },
                               'name' => 'seq read/seq write' },



              'read-16k'      => { 'order' => 1,
                               'block' => '16k',
                               'output' => { 'multiiotester'=>'4', '2'=>'5', '3'=>'6' },
                               'name' => 'sequential read' },
              'write-16k'     => { 'order' => 2,
                               'block' => '16k',
                               'output' => { 'multiiotester'=>'20', '2'=>'25', '3'=>'47' },
                               'name' => 'sequential write' },
              'rw-16k'        => { 'order' => 3,
                               'block' => '16k',
                               'output' => { 'multiiotester'=>'4,20', '2'=>'5,25', '3'=>'6,47' },
                               'name' => 'seq read/seq write' },



              'read-4k'      => { 'order' => 1,
                               'block' => '4k',
                               'output' => { 'multiiotester'=>'4', '2'=>'5', '3'=>'6' },
                               'name' => 'sequential read' },
              'write-4k'     => { 'order' => 2,
                               'block' => '4k',
                               'output' => { 'multiiotester'=>'20', '2'=>'25', '3'=>'47' },
                               'name' => 'sequential write' },
              'rw-4k'        => { 'order' => 3,
                               'block' => '4k',
                               'output' => { 'multiiotester'=>'4,20', '2'=>'5,25', '3'=>'6,47' },
                               'name' => 'seq read/seq write' },




              'randread-4k'  => { 'order' => 4, 
                               'block' => '4k', 
                               'output' => { 'multiiotester'=>'4', '2'=>'5', '3'=>'6' },
                               'name' => 'random read' }, 
              'randwrite-4k' => { 'order' => 5, 
                               'block' => '4k', 
                               'output' => { 'multiiotester'=>'20', '2'=>'25', '3'=>'47' },
                               'name' => 'random write' } , 
              'randrw-4k'    => { 'order' => 6, 
                               'block' => '4k', 
                               'output' => { 'multiiotester'=>'4,20', '2'=>'5,25', '3'=>'6,47' },
                               'name' => 'rand read/rand write' },


              'randread-16k'  => { 'order' => 4, 
                               'block' => '16k', 
                               'output' => { 'multiiotester'=>'4', '2'=>'5', '3'=>'6' },
                               'name' => 'random read' }, 
              'randwrite-16k' => { 'order' => 5, 
                               'block' => '16k', 
                               'output' => { 'multiiotester'=>'20', '2'=>'25', '3'=>'47' },
                               'name' => 'random write' } , 
              'randrw-16k'    => { 'order' => 6, 
                               'block' => '16k', 
                               'output' => { 'multiiotester'=>'4,20', '2'=>'5,25', '3'=>'6,47' },
                               'name' => 'rand read/rand write' },


              'randread-64k'  => { 'order' => 4, 
                               'block' => '64k', 
                               'output' => { 'multiiotester'=>'4', '2'=>'5', '3'=>'6' },
                               'name' => 'random read' }, 
              'randwrite-64k' => { 'order' => 5, 
                               'block' => '64k', 
                               'output' => { 'multiiotester'=>'20', '2'=>'25', '3'=>'47' },
                               'name' => 'random write' } , 
              'randrw-64k'    => { 'order' => 6, 
                               'block' => '64k', 
                               'output' => { 'multiiotester'=>'4,20', '2'=>'5,25', '3'=>'6,47' },
                               'name' => 'rand read/rand write' },



              'randread-256k'  => { 'order' => 4,
                               'block' => '256k',
                               'output' => { 'multiiotester'=>'4', '2'=>'5', '3'=>'6' },
                               'name' => 'random read' },
              'randwrite-256k' => { 'order' => 5,
                               'block' => '256k',
                               'output' => { 'multiiotester'=>'20', '2'=>'25', '3'=>'47' },
                               'name' => 'random write' } ,
              'randrw-256k'    => { 'order' => 6,
                               'block' => '256k',
                               'output' => { 'multiiotester'=>'4,20', '2'=>'5,25', '3'=>'6,47' },
                               'name' => 'rand read/rand write' },



              'randread-1024k'  => { 'order' => 4,
                               'block' => '1024k',
                               'output' => { 'multiiotester'=>'4', '2'=>'5', '3'=>'6' },
                               'name' => 'random read' },
              'randwrite-1024k' => { 'order' => 5,
                               'block' => '1024k',
                               'output' => { 'multiiotester'=>'20', '2'=>'25', '3'=>'47' },
                               'name' => 'random write' } ,
              'randrw-1024k'    => { 'order' => 6,
                               'block' => '1024k',
                               'output' => { 'multiiotester'=>'4,20', '2'=>'5,25', '3'=>'6,47' },
                               'name' => 'rand read/rand write' },



            };

mkdir('./multiiotester') if ! -d './multiiotester';
chdir('./multiiotester') or die "unable to chdir to test directory: $^E";


foreach my $t ( sort{$tests->{$a}->{order} cmp $tests->{$b}->{order}} keys %{$tests} ) {
  print "$colors->{cyan} running IO \"$tests->{$t}->{name} ($t)\" test... $colors->{def}\n";


	# Enable 'next' for testing
	if ( $t !~ /^read\-\d{2}k/ ) {
		#next;
	}

	my $testtype = $t;
	$testtype =~ s/\-.+//;

  my $cmd = "fio --direct=1 --invalidate=1 --ioengine=libaio --iodepth=$testiodepth --thread --time_based --runtime=$testtime --rw=$testtype --bs=$tests->{$t}->{block} --size=$testsize --numjobs=$testjobs --name=$testname --minimal | grep ';'";
  my @output = `$cmd`;
  $output[0] =~ /^(.*?);/;
  my $version = $1;
  my $data;
  my $iop_data;
  
  foreach my $d (@output){
    next unless $d =~ /;/;
    my $field = $tests->{$t}->{output}->{$version};
    my @items = split(";",$d);
    if ($field =~ /(\d+),(\d+)/) {
      $data .= "$items[$1];$items[$2]\n";
			$iop_data .= "$items[$1+1];$items[$2+1]\n";
    } else {
      $data .= "$items[$field]\n";
      $iop_data .= "$items[$field+1]\n";
    }
  }

  my @results = split(/;/,combinejobs($data));
	my @iops = split(/;/,combinejobs($iop_data));

	print "\tresult is $colors->{green}" . join("$colors->{def}/$colors->{green}", map { convert($_) } @results) . "$colors->{def} per second\n";
  print "\tequals $colors->{green}" . join("$colors->{def}/$colors->{green}", @iops) . "$colors->{def} IOs per second\n\n";
  $final_out{$t}{'iops'} = \@iops;
  $final_out{$t}{'rate'} = \@results;


  sleep $restbetweentests;
}

print "$colors->{cyan} running IO \"Real World Test (real)\" test... $colors->{def}\n";

my $cmd = "fio --name $testname --rw randrw --bssplit 4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1 --ioengine libaio --iodepth $testiodepth --numjobs $testjobs --direct 1 --rwmixread 72 --norandommap --minimal --size=$testsize --runtime=$testtime --time_based --thread | grep ';'";
my @output = `$cmd`;
$output[0] =~ /^(.*?);/;
my $version = $1;
my $data;
my $iop_data;

foreach my $d (@output){
  next unless $d =~ /;/;
  my $field = '6,47';
  my $iop_field = '7,48';
  my @items = split(";",$d);
  if ($field =~ /(\d+),(\d+)/) {
    $data .= "$items[$1];$items[$2]\n";
  } else {
    $data .= "$items[$field]\n";
  }
  if ($iop_field =~ /(\d+),(\d+)/) {
    $iop_data .= "$items[$1];$items[$2]\n";
  } else {
    $iop_data .= "$items[$field]\n";
  }
}

my @results = split(/;/,combinejobs($data));
my @iops = split(/;/,combinejobs($iop_data));

print "\tresult is $colors->{green}" . join("$colors->{def}/$colors->{green}", map { convert($_) } @results) . "$colors->{def} per second\n";
print "\tequals $colors->{green}" . join("$colors->{def}/$colors->{green}", @iops) . "$colors->{def} IOs per second\n\n";
$final_out{'real'}{'iops'} = \@iops;
$final_out{'real'}{'rate'} = \@results;


#print "cleaning up files..\n";

#unlink glob "multiiotester*";
#chdir("..");
#rmdir("multiiotester") or print "unable to delete directory 'multiiotester'\n";

###########################
####### subroutines #######
###########################

sub convert {
  my $val = shift;
  my @units = ('KB','MB','GB');
  my $i = 0;

  $val =~ /^\d+/;
  while (length($&) > 3 ) {
    $val = sprintf("%.2f",$val / 1024);
    $i++;
    $val =~ /^\d+/;
  }
  return $val . $units[$i];
}

#sub toiops {
#   my $val = shift;
#   my $blocksize = shift;
# 
#   $blocksize =~ s/k//;
#   my $io = sprintf("%.1f",$val/$blocksize);
#  
#   return $io;
# }

sub combinejobs {
  my $input = shift;
  
  my @lines = split(/\n/,$input);
  my @output = ();

  foreach my $l (0..$#lines) {
    my @temp = split(/;/,$lines[$l]);
    foreach my $t (0..$#temp){
      $output[$t] += $temp[$t];
    }
  }

  return join(";",@output);
}


print "\n\n\n#########################################\n\n";
my $header;
my $csv;
for my $test ( sort keys %final_out ) {
	if ( scalar @{$final_out{$test}{'rate'}} == 1 ) {
		$header .= "$test rate,$test IOPs,";
		$csv .= "@{$final_out{$test}{'rate'}}[0],@{$final_out{$test}{'iops'}}[0],"
	} else {
		$header .= "$test read rate,$test read IOPs,$test write rate,$test write IOPs,";
		$csv .= "@{$final_out{$test}{'rate'}}[0],@{$final_out{$test}{'iops'}}[0],@{$final_out{$test}{'rate'}}[1],@{$final_out{$test}{'iops'}}[1],";
	}
}
chop($header);
chop($csv);
print "$header\n";
print "$csv\n";

print "\n\n#########################################\n\n";

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Memory Allocators and Ceph
       [not found] ` <CAANLjFpErC4xbwgJgZGWFdMaWQ1Q4otBksyRqP0jfWKnqVacog-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-27 17:59   ` Haomai Wang
       [not found]     ` <CACJqLyZS5pVB8ULCc7CNemtd1qRhkfz_mvOS0RRdbiHFbiQn6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-05-27 20:06   ` Mark Nelson
  1 sibling, 1 reply; 7+ messages in thread
From: Haomai Wang @ 2015-05-27 17:59 UTC (permalink / raw)
  To: Robert LeBlanc
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel

On Thu, May 28, 2015 at 1:40 AM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> With all the talk of tcmalloc and jemalloc, I decided to do some
> testing og the different memory allocating technologies between KVM
> and Ceph. These tests were done a pre-production system so I've tried
> to remove some the variance with many runs and averages. The details
> are as follows:
>
> Ceph v0.94.1 (I backported a branch from master to get full jemalloc
> support for part of the tests)
> tcmalloc v2.4-3
> jemalloc v3.6.0-1
> QEMU v0.12.1.2-2 (I understand the latest version for RH6/CentOS6)
> OSDs are only spindles with SSD journals, no SSD tiering
>
> The 11 Ceph nodes are:
> CentOS 7.1
> Linux 3.18.9
> 1 x Intel E5-2640
> 64 GB RAM
> 40 Gb Intel NIC bonded with LACP using jumbo frames
> 10 x Toshiba MG03ACA400 4 TB 7200 RPM drives
> 2 x Intel SSDSC2BB240G4 240GB SSD
> 1 x 32 GB SATADOM for OS
>
> The KVM node is:
> CentOS 6.6
> Linux 3.12.39
> QEMU v0.12.1.2-2 cache mode none
>
> The VM is:
> CentOS 6.6
> Linux 2.6.32-504
> fio v2.1.10
>
> On average preloading Ceph with either tcmalloc or jemalloc showed an
> increase of performance of about 30% with most performance gains for
> smaller I/O. Although preloading QEMU with jemalloc provided about a
> 6% increase on a lightly loaded server, it did not add or subtract a
> noticeable performance difference combined with Ceph using either
> tcmalloc or jemalloc.
>
> Compiling Ceph entirely with jemalloc overall had a negative
> performance impact. This may be due to dynamically linking to RocksDB
> instead of the default static linking.
>
> Preloading QEMU with tcmalloc in all cases overall showed very
> negative results, however it showed the most improvement of any tests
> in the 1MB tests up to almost 2.5x performance of the baseline. If
> your workload is guaranteed to be of 1MB I/O (and possibly larger),
> then this option may be useful.
>
> Based on the architecture of jemalloc, it is possible that with it
> loaded on the QEMU host may provide more benefit on servers that are
> closer to memory capacity, but I did not test this scenario.
>
> Any feedback regarding this exercise is welcome.

Really cool!!!

It's really an important job to help us realize so such difference by
memory allocation library.

Recently I did some basic works and want to invest ceph memory
allocation characteristic workload, I'm hesitate to do this because of
the unknown things about improvements. Now the top cpu usage is
consumed by memory allocation/free, and I see different io size
workloads(and high cpu usage) will result in terrible performance for
ceph cluster. I hope we can lower a cpu level for ceph require(for
fast storage device backend) by solving this problem

BTW, could I know the details about your workload?

>
> Data: https://docs.google.com/a/leblancnet.us/spreadsheets/d/1n12IqAOuH2wH-A7Sq5boU8kSEYg_Pl20sPmM0idjj00/edit?usp=sharing
> Test script is multitest. The real world test is based off of the disk
> stats of about 100 of our servers which have uptimes of many months.
>
> - - ----------------
> Robert LeBlanc
> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v0.13.1
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJVZgGRCRDmVDuy+mK58QAAM20QAJh0rR0NIQABCkMjiluP
> f/mcIiy4MQfFd5RJ9/ZlMRDQ0KDwW7haRm58QE0S/l6ZZ3+z7MqsQOW8KHJE
> Y75YjEdsl7zrLLcB4wNnUKJXZrPwzFReTtLbXsNB8h73tbzaLp3y9711gbNf
> EQQujiSp5XDiOK+d+H0FVGp4AfVmFvlO5gjQMSUcUt58qN6BsnD8NbRLEvKf
> S2WzvJjFO7g1HqWr5QssKGb+1rvze2Z2xByURU8yKVpdX59EIhfzPdgadp/n
> AJGR2pXWGgW2CQ3ce7gN7cr32cjjWbmzpdr0djgVB5/Y1ERU8FvwNFIwFa6N
> eFUKCohW5UjMw8CcO9CzUQtQxgKnqeHcyVe6Loamd2eZ+epIupFLI3lQF6NU
> GSdBV/8Ale1SJuhShY6QnEJFav8nLTvNvlDF/NiBoSUMtnsl5fDTpLH3KA2w
> o8sT2dcDEJEc9+kzUrugUBElinjOacFcINU3osYZJ0NNi4t1PDtPTUiWChvT
> jZdpWVGVpxZ3w46csACJZxY0lP/Kd6JoSH+78q7wNivCHeHT7c3uy8KGbKA7
> fecFaHBAsCYliX1tDN/abZFVhEvdb8AuTGqGkZ7xHj0PAUyddObYGjkStVUw
> dGOH+nurnFZ5Qqct/gvcbxggbOTGunHLGwtALT5EAtTB1ThlfpVQImy5vKl0
> aOER
> =YTTi
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Memory Allocators and Ceph
       [not found]     ` <CACJqLyZS5pVB8ULCc7CNemtd1qRhkfz_mvOS0RRdbiHFbiQn6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-27 18:12       ` Robert LeBlanc
  0 siblings, 0 replies; 7+ messages in thread
From: Robert LeBlanc @ 2015-05-27 18:12 UTC (permalink / raw)
  To: Haomai Wang
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

The workload is on average, 17KB per read request and 13KB per write
request with 73% read abd 27% write. This is a web hosting workload.
- ----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, May 27, 2015 at 11:59 AM, Haomai Wang  wrote:
> On Thu, May 28, 2015 at 1:40 AM, Robert LeBlanc  wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> With all the talk of tcmalloc and jemalloc, I decided to do some
>> testing og the different memory allocating technologies between KVM
>> and Ceph. These tests were done a pre-production system so I've tried
>> to remove some the variance with many runs and averages. The details
>> are as follows:
>>
>> Ceph v0.94.1 (I backported a branch from master to get full jemalloc
>> support for part of the tests)
>> tcmalloc v2.4-3
>> jemalloc v3.6.0-1
>> QEMU v0.12.1.2-2 (I understand the latest version for RH6/CentOS6)
>> OSDs are only spindles with SSD journals, no SSD tiering
>>
>> The 11 Ceph nodes are:
>> CentOS 7.1
>> Linux 3.18.9
>> 1 x Intel E5-2640
>> 64 GB RAM
>> 40 Gb Intel NIC bonded with LACP using jumbo frames
>> 10 x Toshiba MG03ACA400 4 TB 7200 RPM drives
>> 2 x Intel SSDSC2BB240G4 240GB SSD
>> 1 x 32 GB SATADOM for OS
>>
>> The KVM node is:
>> CentOS 6.6
>> Linux 3.12.39
>> QEMU v0.12.1.2-2 cache mode none
>>
>> The VM is:
>> CentOS 6.6
>> Linux 2.6.32-504
>> fio v2.1.10
>>
>> On average preloading Ceph with either tcmalloc or jemalloc showed an
>> increase of performance of about 30% with most performance gains for
>> smaller I/O. Although preloading QEMU with jemalloc provided about a
>> 6% increase on a lightly loaded server, it did not add or subtract a
>> noticeable performance difference combined with Ceph using either
>> tcmalloc or jemalloc.
>>
>> Compiling Ceph entirely with jemalloc overall had a negative
>> performance impact. This may be due to dynamically linking to RocksDB
>> instead of the default static linking.
>>
>> Preloading QEMU with tcmalloc in all cases overall showed very
>> negative results, however it showed the most improvement of any tests
>> in the 1MB tests up to almost 2.5x performance of the baseline. If
>> your workload is guaranteed to be of 1MB I/O (and possibly larger),
>> then this option may be useful.
>>
>> Based on the architecture of jemalloc, it is possible that with it
>> loaded on the QEMU host may provide more benefit on servers that are
>> closer to memory capacity, but I did not test this scenario.
>>
>> Any feedback regarding this exercise is welcome.
>
> Really cool!!!
>
> It's really an important job to help us realize so such difference by
> memory allocation library.
>
> Recently I did some basic works and want to invest ceph memory
> allocation characteristic workload, I'm hesitate to do this because of
> the unknown things about improvements. Now the top cpu usage is
> consumed by memory allocation/free, and I see different io size
> workloads(and high cpu usage) will result in terrible performance for
> ceph cluster. I hope we can lower a cpu level for ceph require(for
> fast storage device backend) by solving this problem
>
> BTW, could I know the details about your workload?
>
>>
>> Data: https://docs.google.com/a/leblancnet.us/spreadsheets/d/1n12IqAOuH2wH-A7Sq5boU8kSEYg_Pl20sPmM0idjj00/edit?usp=sharing
>> Test script is multitest. The real world test is based off of the disk
>> stats of about 100 of our servers which have uptimes of many months.
>>
>> - - ----------------
>> Robert LeBlanc
>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v0.13.1
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJVZgGRCRDmVDuy+mK58QAAM20QAJh0rR0NIQABCkMjiluP
>> f/mcIiy4MQfFd5RJ9/ZlMRDQ0KDwW7haRm58QE0S/l6ZZ3+z7MqsQOW8KHJE
>> Y75YjEdsl7zrLLcB4wNnUKJXZrPwzFReTtLbXsNB8h73tbzaLp3y9711gbNf
>> EQQujiSp5XDiOK+d+H0FVGp4AfVmFvlO5gjQMSUcUt58qN6BsnD8NbRLEvKf
>> S2WzvJjFO7g1HqWr5QssKGb+1rvze2Z2xByURU8yKVpdX59EIhfzPdgadp/n
>> AJGR2pXWGgW2CQ3ce7gN7cr32cjjWbmzpdr0djgVB5/Y1ERU8FvwNFIwFa6N
>> eFUKCohW5UjMw8CcO9CzUQtQxgKnqeHcyVe6Loamd2eZ+epIupFLI3lQF6NU
>> GSdBV/8Ale1SJuhShY6QnEJFav8nLTvNvlDF/NiBoSUMtnsl5fDTpLH3KA2w
>> o8sT2dcDEJEc9+kzUrugUBElinjOacFcINU3osYZJ0NNi4t1PDtPTUiWChvT
>> jZdpWVGVpxZ3w46csACJZxY0lP/Kd6JoSH+78q7wNivCHeHT7c3uy8KGbKA7
>> fecFaHBAsCYliX1tDN/abZFVhEvdb8AuTGqGkZ7xHj0PAUyddObYGjkStVUw
>> dGOH+nurnFZ5Qqct/gvcbxggbOTGunHLGwtALT5EAtTB1ThlfpVQImy5vKl0
>> aOER
>> =YTTi
>> -----END PGP SIGNATURE-----
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Best Regards,
>
> Wheat

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVZgjYCRDmVDuy+mK58QAAZfIQAML2Z20JnZw9+sU2vnvr
oizfxb5TGuPwPNKaFYcbM3+gCmfYBFRIR87u/VWo/V/5Y1s0qLYcsZco+rmE
qj7KHGQ5FM22e2pX5dc7PqzlqIe8KP66hsRfqwGTgZkQJAIYn5O02TA8JXrh
Yrdu+4xttPFOy+WCEmlKYDsYhHJHwQ3dkeXIC0TRvMYABc+5j9W59qCa8fq7
QxSQs4HGhBYB6kRi6fX8pl0NZ675bsgEmBeJ7ZkbdKfpPycj9Py/SCIXJtEg
amVEO3ABZ89uIglUyOkCvK5Pakpx4Pd8nMfhQf2iXyfEPWHLYZ4w8i0UyJC2
880udQdghxdXm8Z9s9STD3IIHUjsC99ltfnp2zSWjnHAm+OMqxRVTmsD9z1a
6eyzNBRi55VuXMqbZpRuAnwiNGniucZLG1dTtQtTR14/56mDeJLt8gcG6rHM
Glfm7YHyB+JpU4MUSKSRSRs1qfyDigxmynniNC6G0qvQIDrL4UBL1/LMKKKd
CiBUCvK337PMbaePDDT3EZKe5YoZsbxQf/GGD4WB8BgkjF79JmHYkZarivYb
acthRIHF3Y/OzU133Tg3YXC3hVe0y42u2OmqBJzbPWytyw3FuIhR8KFBLx6O
qp2Mj6HXGLJ3LNvsYA1hAmimxsjR9AbGsMYFCnYbwPX4ZvD23gPk8lxEMBiP
VcVd
=bQPB
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Memory Allocators and Ceph
       [not found] ` <CAANLjFpErC4xbwgJgZGWFdMaWQ1Q4otBksyRqP0jfWKnqVacog-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-05-27 17:59   ` Haomai Wang
@ 2015-05-27 20:06   ` Mark Nelson
       [not found]     ` <556623AB.9030804-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 7+ messages in thread
From: Mark Nelson @ 2015-05-27 20:06 UTC (permalink / raw)
  To: Robert LeBlanc,
	ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel

On 05/27/2015 12:40 PM, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> With all the talk of tcmalloc and jemalloc, I decided to do some
> testing og the different memory allocating technologies between KVM
> and Ceph. These tests were done a pre-production system so I've tried
> to remove some the variance with many runs and averages. The details
> are as follows:
>
> Ceph v0.94.1 (I backported a branch from master to get full jemalloc
> support for part of the tests)
> tcmalloc v2.4-3
> jemalloc v3.6.0-1
> QEMU v0.12.1.2-2 (I understand the latest version for RH6/CentOS6)
> OSDs are only spindles with SSD journals, no SSD tiering
>
> The 11 Ceph nodes are:
> CentOS 7.1
> Linux 3.18.9
> 1 x Intel E5-2640
> 64 GB RAM
> 40 Gb Intel NIC bonded with LACP using jumbo frames
> 10 x Toshiba MG03ACA400 4 TB 7200 RPM drives
> 2 x Intel SSDSC2BB240G4 240GB SSD
> 1 x 32 GB SATADOM for OS
>
> The KVM node is:
> CentOS 6.6
> Linux 3.12.39
> QEMU v0.12.1.2-2 cache mode none
>
> The VM is:
> CentOS 6.6
> Linux 2.6.32-504
> fio v2.1.10
>
> On average preloading Ceph with either tcmalloc or jemalloc showed an
> increase of performance of about 30% with most performance gains for
> smaller I/O. Although preloading QEMU with jemalloc provided about a
> 6% increase on a lightly loaded server, it did not add or subtract a
> noticeable performance difference combined with Ceph using either
> tcmalloc or jemalloc.

Very interesting tests Robert!

>
> Compiling Ceph entirely with jemalloc overall had a negative
> performance impact. This may be due to dynamically linking to RocksDB
> instead of the default static linking.

Is it possible that there were any other differences?  A 30% gain 
turning into a 30% loss with pre-loading vs compiling seems pretty crazy!

>
> Preloading QEMU with tcmalloc in all cases overall showed very
> negative results, however it showed the most improvement of any tests
> in the 1MB tests up to almost 2.5x performance of the baseline. If
> your workload is guaranteed to be of 1MB I/O (and possibly larger),
> then this option may be useful.
>
> Based on the architecture of jemalloc, it is possible that with it
> loaded on the QEMU host may provide more benefit on servers that are
> closer to memory capacity, but I did not test this scenario.
>
> Any feedback regarding this exercise is welcome.

Might be worth trying to reproduce the results and grab perf data or 
some other kind of trace data during the tests.  There's so much 
variability here it's really tough to get an idea of why the performance 
swings so dramatically.

Still, excellent testing!  We definitely need more of this so we can 
determine if jemalloc is something that would be worth switching to 
eventually.

>
> Data: https://docs.google.com/a/leblancnet.us/spreadsheets/d/1n12IqAOuH2wH-A7Sq5boU8kSEYg_Pl20sPmM0idjj00/edit?usp=sharing
> Test script is multitest. The real world test is based off of the disk
> stats of about 100 of our servers which have uptimes of many months.
>
> - - ----------------
> Robert LeBlanc
> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v0.13.1
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJVZgGRCRDmVDuy+mK58QAAM20QAJh0rR0NIQABCkMjiluP
> f/mcIiy4MQfFd5RJ9/ZlMRDQ0KDwW7haRm58QE0S/l6ZZ3+z7MqsQOW8KHJE
> Y75YjEdsl7zrLLcB4wNnUKJXZrPwzFReTtLbXsNB8h73tbzaLp3y9711gbNf
> EQQujiSp5XDiOK+d+H0FVGp4AfVmFvlO5gjQMSUcUt58qN6BsnD8NbRLEvKf
> S2WzvJjFO7g1HqWr5QssKGb+1rvze2Z2xByURU8yKVpdX59EIhfzPdgadp/n
> AJGR2pXWGgW2CQ3ce7gN7cr32cjjWbmzpdr0djgVB5/Y1ERU8FvwNFIwFa6N
> eFUKCohW5UjMw8CcO9CzUQtQxgKnqeHcyVe6Loamd2eZ+epIupFLI3lQF6NU
> GSdBV/8Ale1SJuhShY6QnEJFav8nLTvNvlDF/NiBoSUMtnsl5fDTpLH3KA2w
> o8sT2dcDEJEc9+kzUrugUBElinjOacFcINU3osYZJ0NNi4t1PDtPTUiWChvT
> jZdpWVGVpxZ3w46csACJZxY0lP/Kd6JoSH+78q7wNivCHeHT7c3uy8KGbKA7
> fecFaHBAsCYliX1tDN/abZFVhEvdb8AuTGqGkZ7xHj0PAUyddObYGjkStVUw
> dGOH+nurnFZ5Qqct/gvcbxggbOTGunHLGwtALT5EAtTB1ThlfpVQImy5vKl0
> aOER
> =YTTi
> -----END PGP SIGNATURE-----
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Memory Allocators and Ceph
       [not found]     ` <556623AB.9030804-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-27 21:00       ` Robert LeBlanc
       [not found]         ` <CAANLjFr=f=o4_2admJ9rxdxrB5XBcDy8i2mYzVtEYP_mFZb_Aw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Robert LeBlanc @ 2015-05-27 21:00 UTC (permalink / raw)
  To: Mark Nelson
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Wed, May 27, 2015 at 2:06 PM, Mark Nelson  wrote:
>> Compiling Ceph entirely with jemalloc overall had a negative
>> performance impact. This may be due to dynamically linking to RocksDB
>> instead of the default static linking.
>
>
> Is it possible that there were any other differences?  A 30% gain turning
> into a 30% loss with pre-loading vs compiling seems pretty crazy!

I tried hard to minimize the differences by backporting the Ceph
jemalloc feature into 0.94.1 that was used in the other testing. I did
have to get RocksDB from master to get it to compile against jemalloc
so there is some difference there. When preloading Ceph with jemalloc,
parts of Ceph still used tcmalloc because it was statically linked to
by RocksDB, so it was using both allocators during those tests.
Programming is not my forte so it is likely that I may have botched
something with that test.

The goal of the test was to see if and where these allocators may
help/hinder performance. It could also provide some feedback to Ceph
devs on how to leverage one or the other or both. I don't consider
this test to be extremely reliable as there is some variability in
this pre-production system even though I tried to remove the
variability to an extent.

I hope others can build on this as a jumping off point and at least
have some interesting places to look instead of having to scope out a
large section of the space.

> Might be worth trying to reproduce the results and grab perf data or some
> other kind of trace data during the tests.  There's so much variability here
> it's really tough to get an idea of why the performance swings so
> dramatically.

I'm not very familiar with the perf tools (can you use them with
jemalloc?) and what would be useful. If you would like to tell me some
configurations and tests you are interested in and let me know how you
want perf to generate the data, I can see what I can do to provide
that. Each test suite takes about 9 hours to run so it is pretty
intensive.

Each "sub-test" (i.e. 4K seq read) takes 5 minutes, so it is much
easier to run selections of those if there are specific tests you are
interested in. I'm happy to provide data, but given the time to run
these tests if we can focus on specific areas it would provide
data/benefits much faster.

>
> Still, excellent testing!  We definitely need more of this so we can
> determine if jemalloc is something that would be worth switching to
> eventually.
>
>

- ----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVZjCACRDmVDuy+mK58QAAsHIQAImJWLkGix2sDKCZgcME
0RHmelyEBtFFjIUNJvrwC0PvUKqQ/sffdtC+QLLcFYKOO2G5lrojKhCdwhXI
OP0O1IqMcXUCBcq5yNJf8O6uzQ56Q4qCHWJmg49JRHx4gQLNK9VtGLRevL96
JNrwhllpI5v+ewuQR/P2uD/NAXhFWDjEXLO4xHQGylOQOOVRQBlWeq+3QLqX
4Zz+yiY4VIdhSe/z3aQYxes12snyjF2zP2Zo/BS47KBtVbmOJ7wGBGIFy8nw
T4r7HYapCX3sqAN/fHEvwgcunYaW4y8aZT2a3Lv0PZKz23d6zcOUBPEFJ86W
DzZyqqmDq7QJLtUnAb1yyQj23bWntI/zoT83zWCUvPHU7odmlBvSWZ8w7ToC
mpOYjPw5CBVvztCFM2gwnmEXdM0qtmtdv/NhfQVu5+FNhQDSlhOPMCXdM3wf
2JjuygdfRg4kGE6KyX4nYSZxfacsvX3SIkLnKYsdeWMNMZwGC6TvulApY61s
sedwbe+hyFqlfGlbM+QCtW495Wr9EcfFdM/PWUDkXtfmfE20UdqAKYzIeJfC
F8HS5sZz6GtiLb1Dbiq69hNmUUtfDEIDVssARKbMtmZ30bPdNe42grBttzDG
3aNc05TwFe72HMjAhtvQrkrq1C+4XZA3mpNnosiXCUJT9WeOAOJbzWQS0mUS
Yrtb
=+ESo
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Memory Allocators and Ceph
       [not found]         ` <CAANLjFr=f=o4_2admJ9rxdxrB5XBcDy8i2mYzVtEYP_mFZb_Aw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-05-27 21:48           ` Mark Nelson
       [not found]             ` <55663BB0.7090500-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Nelson @ 2015-05-27 21:48 UTC (permalink / raw)
  To: Robert LeBlanc
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel



On 05/27/2015 04:00 PM, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
>
> On Wed, May 27, 2015 at 2:06 PM, Mark Nelson  wrote:
>>> Compiling Ceph entirely with jemalloc overall had a negative
>>> performance impact. This may be due to dynamically linking to RocksDB
>>> instead of the default static linking.
>>
>>
>> Is it possible that there were any other differences?  A 30% gain turning
>> into a 30% loss with pre-loading vs compiling seems pretty crazy!
>
> I tried hard to minimize the differences by backporting the Ceph
> jemalloc feature into 0.94.1 that was used in the other testing. I did
> have to get RocksDB from master to get it to compile against jemalloc
> so there is some difference there. When preloading Ceph with jemalloc,
> parts of Ceph still used tcmalloc because it was statically linked to
> by RocksDB, so it was using both allocators during those tests.
> Programming is not my forte so it is likely that I may have botched
> something with that test.
>
> The goal of the test was to see if and where these allocators may
> help/hinder performance. It could also provide some feedback to Ceph
> devs on how to leverage one or the other or both. I don't consider
> this test to be extremely reliable as there is some variability in
> this pre-production system even though I tried to remove the
> variability to an extent.
>
> I hope others can build on this as a jumping off point and at least
> have some interesting places to look instead of having to scope out a
> large section of the space.
>
>
>> Might be worth trying to reproduce the results and grab perf data or some
>> other kind of trace data during the tests.  There's so much variability here
>> it's really tough to get an idea of why the performance swings so
>> dramatically.
>
> I'm not very familiar with the perf tools (can you use them with
> jemalloc?) and what would be useful. If you would like to tell me some
> configurations and tests you are interested in and let me know how you
> want perf to generate the data, I can see what I can do to provide
> that. Each test suite takes about 9 hours to run so it is pretty
> intensive.

perf can give you a call graph showing how much cpu time is being spent 
in different parts of the code.

Something like this during the test:

sudo perf record --call-graph dwarf -F 99 -a
sudo perf report

You may need a newish kernel/os for dwarf support to work.  There are 
probably other tools that may also give insights into what is going on.

>
> Each "sub-test" (i.e. 4K seq read) takes 5 minutes, so it is much
> easier to run selections of those if there are specific tests you are
> interested in. I'm happy to provide data, but given the time to run
> these tests if we can focus on specific areas it would provide
> data/benefits much faster.

I guess starting out I'm interested in what's happening with preloaded 
vs compiled jemalloc.  Other tests might be interesting too though!

>
>>
>> Still, excellent testing!  We definitely need more of this so we can
>> determine if jemalloc is something that would be worth switching to
>> eventually.
>>
>>
>
>
> - ----------------
> Robert LeBlanc
> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v0.13.1
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJVZjCACRDmVDuy+mK58QAAsHIQAImJWLkGix2sDKCZgcME
> 0RHmelyEBtFFjIUNJvrwC0PvUKqQ/sffdtC+QLLcFYKOO2G5lrojKhCdwhXI
> OP0O1IqMcXUCBcq5yNJf8O6uzQ56Q4qCHWJmg49JRHx4gQLNK9VtGLRevL96
> JNrwhllpI5v+ewuQR/P2uD/NAXhFWDjEXLO4xHQGylOQOOVRQBlWeq+3QLqX
> 4Zz+yiY4VIdhSe/z3aQYxes12snyjF2zP2Zo/BS47KBtVbmOJ7wGBGIFy8nw
> T4r7HYapCX3sqAN/fHEvwgcunYaW4y8aZT2a3Lv0PZKz23d6zcOUBPEFJ86W
> DzZyqqmDq7QJLtUnAb1yyQj23bWntI/zoT83zWCUvPHU7odmlBvSWZ8w7ToC
> mpOYjPw5CBVvztCFM2gwnmEXdM0qtmtdv/NhfQVu5+FNhQDSlhOPMCXdM3wf
> 2JjuygdfRg4kGE6KyX4nYSZxfacsvX3SIkLnKYsdeWMNMZwGC6TvulApY61s
> sedwbe+hyFqlfGlbM+QCtW495Wr9EcfFdM/PWUDkXtfmfE20UdqAKYzIeJfC
> F8HS5sZz6GtiLb1Dbiq69hNmUUtfDEIDVssARKbMtmZ30bPdNe42grBttzDG
> 3aNc05TwFe72HMjAhtvQrkrq1C+4XZA3mpNnosiXCUJT9WeOAOJbzWQS0mUS
> Yrtb
> =+ESo
> -----END PGP SIGNATURE-----
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Memory Allocators and Ceph
       [not found]             ` <55663BB0.7090500-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-05-28 15:54               ` Robert LeBlanc
  0 siblings, 0 replies; 7+ messages in thread
From: Robert LeBlanc @ 2015-05-28 15:54 UTC (permalink / raw)
  To: Mark Nelson
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I've got some more tests running right now. Once those are done, I'll
find a couple of tests that had extreme difference and gather some
perf data for them.
- ----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, May 27, 2015 at 3:48 PM, Mark Nelson  wrote:
>
>
> On 05/27/2015 04:00 PM, Robert LeBlanc wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>>
>> On Wed, May 27, 2015 at 2:06 PM, Mark Nelson  wrote:
>>>>
>>>> Compiling Ceph entirely with jemalloc overall had a negative
>>>> performance impact. This may be due to dynamically linking to RocksDB
>>>> instead of the default static linking.
>>>
>>>
>>>
>>> Is it possible that there were any other differences?  A 30% gain turning
>>> into a 30% loss with pre-loading vs compiling seems pretty crazy!
>>
>>
>> I tried hard to minimize the differences by backporting the Ceph
>> jemalloc feature into 0.94.1 that was used in the other testing. I did
>> have to get RocksDB from master to get it to compile against jemalloc
>> so there is some difference there. When preloading Ceph with jemalloc,
>> parts of Ceph still used tcmalloc because it was statically linked to
>> by RocksDB, so it was using both allocators during those tests.
>> Programming is not my forte so it is likely that I may have botched
>> something with that test.
>>
>> The goal of the test was to see if and where these allocators may
>> help/hinder performance. It could also provide some feedback to Ceph
>> devs on how to leverage one or the other or both. I don't consider
>> this test to be extremely reliable as there is some variability in
>> this pre-production system even though I tried to remove the
>> variability to an extent.
>>
>> I hope others can build on this as a jumping off point and at least
>> have some interesting places to look instead of having to scope out a
>> large section of the space.
>>
>>
>>> Might be worth trying to reproduce the results and grab perf data or some
>>> other kind of trace data during the tests.  There's so much variability
>>> here
>>> it's really tough to get an idea of why the performance swings so
>>> dramatically.
>>
>>
>> I'm not very familiar with the perf tools (can you use them with
>> jemalloc?) and what would be useful. If you would like to tell me some
>> configurations and tests you are interested in and let me know how you
>> want perf to generate the data, I can see what I can do to provide
>> that. Each test suite takes about 9 hours to run so it is pretty
>> intensive.
>
>
> perf can give you a call graph showing how much cpu time is being spent in
> different parts of the code.
>
> Something like this during the test:
>
> sudo perf record --call-graph dwarf -F 99 -a
> sudo perf report
>
> You may need a newish kernel/os for dwarf support to work.  There are
> probably other tools that may also give insights into what is going on.
>
>>
>> Each "sub-test" (i.e. 4K seq read) takes 5 minutes, so it is much
>> easier to run selections of those if there are specific tests you are
>> interested in. I'm happy to provide data, but given the time to run
>> these tests if we can focus on specific areas it would provide
>> data/benefits much faster.
>
>
> I guess starting out I'm interested in what's happening with preloaded vs
> compiled jemalloc.  Other tests might be interesting too though!
>
>
>>
>>>
>>> Still, excellent testing!  We definitely need more of this so we can
>>> determine if jemalloc is something that would be worth switching to
>>> eventually.
>>>
>>>
>>
>>
>> - ----------------
>> Robert LeBlanc
>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v0.13.1
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJVZjCACRDmVDuy+mK58QAAsHIQAImJWLkGix2sDKCZgcME
>> 0RHmelyEBtFFjIUNJvrwC0PvUKqQ/sffdtC+QLLcFYKOO2G5lrojKhCdwhXI
>> OP0O1IqMcXUCBcq5yNJf8O6uzQ56Q4qCHWJmg49JRHx4gQLNK9VtGLRevL96
>> JNrwhllpI5v+ewuQR/P2uD/NAXhFWDjEXLO4xHQGylOQOOVRQBlWeq+3QLqX
>> 4Zz+yiY4VIdhSe/z3aQYxes12snyjF2zP2Zo/BS47KBtVbmOJ7wGBGIFy8nw
>> T4r7HYapCX3sqAN/fHEvwgcunYaW4y8aZT2a3Lv0PZKz23d6zcOUBPEFJ86W
>> DzZyqqmDq7QJLtUnAb1yyQj23bWntI/zoT83zWCUvPHU7odmlBvSWZ8w7ToC
>> mpOYjPw5CBVvztCFM2gwnmEXdM0qtmtdv/NhfQVu5+FNhQDSlhOPMCXdM3wf
>> 2JjuygdfRg4kGE6KyX4nYSZxfacsvX3SIkLnKYsdeWMNMZwGC6TvulApY61s
>> sedwbe+hyFqlfGlbM+QCtW495Wr9EcfFdM/PWUDkXtfmfE20UdqAKYzIeJfC
>> F8HS5sZz6GtiLb1Dbiq69hNmUUtfDEIDVssARKbMtmZ30bPdNe42grBttzDG
>> 3aNc05TwFe72HMjAhtvQrkrq1C+4XZA3mpNnosiXCUJT9WeOAOJbzWQS0mUS
>> Yrtb
>> =+ESo
>> -----END PGP SIGNATURE-----
>>
>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVZzoiCRDmVDuy+mK58QAAwiIQALFexcUi7eeosd36JMPQ
ZfDKaeLkkZoftAtM3EYAZVfx2vdiUDeQKyecdhgFin2CGz68NFRRBjZZ9qll
USMyfk85X71XQh7cZplkFGc4fwKN2leUDJWbnbpB8PQa15ocj+wBOlfeFmTX
PCW0+fv06slo/uCPtJH0Drl978pU1MXrESYJwJaGcfK9IUgCGD/w+4rtGwt3
ITvEfdmDBwEmNErxFojBcQ1XTxbb5tDXMjwJ9acdg0mDg0PiKXGtu79fJrle
kouO2RyBYNfA5/w83Hy8IhFncI+9XO2NnCF4pGR6G35yhwNq6TuA67bPQ4ip
+fdkPvp+/v3YOpeB0iBkZJLSGQVTICbCEW3GQNT9lhZ31cc/tyWqMLh5Zdwq
r087wndLF/1LKOGG9M+LK44l1AJG0xKj8DQUgvP2/Nv6Mb9od+Nc0jFM0ysc
OFB7bhwk16Q6rNM0U/Zr6DvnhhTyrP7yMGEw3cGDKW9QHHYaHBl8hOlzVPUb
h5fgkciq4fhwCVNLWDvU0A5Bf/chhF842Zhws0BGSg8EJ/dKpaHyNiUXUWpS
SjcNQNssgHMLawE/YJFL5FOuJ9aNXLwBDvkofHKQ3oQkPelHfLEF9L2FWI7/
45wq3dZe5QRePWA1gQDfx3eUeBGNUEIBb9KBw+fvGV2uV3oFdnu2t7b59JlB
r7f6
=uLA0
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-05-28 15:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-27 17:40 Memory Allocators and Ceph Robert LeBlanc
     [not found] ` <CAANLjFpErC4xbwgJgZGWFdMaWQ1Q4otBksyRqP0jfWKnqVacog-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-27 17:59   ` Haomai Wang
     [not found]     ` <CACJqLyZS5pVB8ULCc7CNemtd1qRhkfz_mvOS0RRdbiHFbiQn6A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-27 18:12       ` Robert LeBlanc
2015-05-27 20:06   ` Mark Nelson
     [not found]     ` <556623AB.9030804-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-27 21:00       ` Robert LeBlanc
     [not found]         ` <CAANLjFr=f=o4_2admJ9rxdxrB5XBcDy8i2mYzVtEYP_mFZb_Aw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-27 21:48           ` Mark Nelson
     [not found]             ` <55663BB0.7090500-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-05-28 15:54               ` Robert LeBlanc

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.