On 04/13/2015 10:27 AM, Sage Weil wrote:
> [adding ceph-devel]
>
> On Mon, 13 Apr 2015, Chen, Xiaoxi wrote:
>> Hi,
>>
>>        Actually I have done the tuning survey on RocksDB when I was
>> updating the RocksDB to newer version and exposed the tuning in
>> ceph.conf.
>>
>>        What we need to ensure is the WAL never hit the disk. The rocksdb
>
> We'll always have to pay that 1x write to the log; we just want to make
> sure it doesn't turn into 2x.  I take it you're assuming the log is on an
> SSD (not disk)?
>
>> write ahead log is already introduce 1X write, if the data flushed to
>> SST in level 0, that will be 2X, not to mention any further compaction.
>>
>>        The tuning that makes the differences are :
>> 	write_buffer_size
>> 	max_write_buffer_number
>> 	min_write_buffer_number_to_merge
>>
>>        Say if we have
>> 	write_buffer_size =512M
>> 	max_write_buffer_number = 6
>> 	min_write_buffer_number_to_merge =2

Attached are tests for a single PCIE ssd with filestore, newstore + 
fsync + default tunables, newstore+fsync + Xiaoxi's tunables, and also a 
test using xiaoxi's tunables with fdatasync.

Basically Xioaxi's tunables help, and fdatasync helps a little more 
(mostly at small IO sizes), but still not enough to get us to beat 
filestore, though newstore *does* do consistently better than filestore 
with 4MB writes now.

Mark