NewStore performance analysis

All of lore.kernel.org
 help / color / mirror / Atom feed

* NewStore performance analysis
@ 2015-04-20 14:59 Chen, Xiaoxi
  2015-04-20 15:39 ` Sage Weil
  0 siblings, 1 reply; 9+ messages in thread
From: Chen, Xiaoxi @ 2015-04-20 14:59 UTC (permalink / raw)
  To: Sage Weil, Mark Nelson, Somnath Roy, Chen, Xiaoxi
  Cc: Duan, Jiangang, Zhang, Jian, ceph-devel

[Resend in plain text]

Hi,
       I have played some tunable on RocksDB these days, try to optimize the performance of Newstore.   From the data now ,seems the WA of  RocksDB is not the issue that blocking the performance, and also seems not the fragment part(aio/dio, etc). The issue might be how much OPS rocksdb can offer under 1-write-per-sync workload. I cannot find the number online so I will do it by myself,  if that number is low, maybe we need holding multiple RocksDB instance in one OSD and do some sharding .

The WAL log of Rocksdb, RocksDB data file and Newstore directory were backed by 3 separate SSDs. 
/dev/sdc1      156172796    32928 156139868   1% /root/ceph-0-db
/dev/sdd1      195264572    32928 195231644   1% /root/ceph-0-db-wal
/dev/sdb1      156172796 10589552 145583244   7% /var/lib/ceph/osd/ceph-0

Some interesting finds here:

1.  Avg_reqsz in SDB(newstore FS part) is 2KB, that is half of the request block size(4KB),  IOPS in iostat(2K) is ~ 2X of the number reported by FIO. BW matched

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00   58.33     0.00    28.98  1017.51     6.33  108.55    0.00  108.55   1.30   7.60
sdb               0.00     0.00    0.00 2038.00     0.00     3.98     4.00     0.13    0.07    0.00    0.07   0.07  13.33
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00   747.67    0.00 2099.67     0.00    11.28    11.00     0.76    0.36    0.00    0.36   0.36  75.73  

I believe newstore will not split the request, so there should be some very small IO(~0KB) goes with the data write(4KB), where the small IO comes from ?  

Also checked the Filestore data, this behavior is not present in Filestore, changing the WBThrottle will affect the number. So seems this behavior is related with the flushing mechanism? In newstore we are doing fdatasync more aggressively.

2. Notice that by tuning the write_buffer_size , wirte_buffer_num and min_write_buffer_number_to_merge, we can make the DB write to ZERO

Look at the iostat of SDC, actually there is almost no IO happened there, that is because most of the WAL entries were merged before flushing to Level0.

Other RocksDB tuning are originally trying  to optimize the compaction behavior, but since there is few data written to Level0, the compaction is almost unmeasurable here.

3. Disable RocksDB WAL can 3X  the performance(Although this is definitely WRONG WAY)

Just curious if there is no extra IO happened in DB side, what the performance looks like.
I turn off the WAL log of rocks DB, the performance is 3x(799-2464 , lat from 10 -> 3.2)

4. The avg queue size is <1 in any case, both DB_WAL part and fragment part.

I guess there is some lock in rocksdb::WriteBatch() that preventing multiple OSD_OP_THREAD working concurrently, not carefully analyzed. 

An easy way to measure might be comment out  db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to see if we can get more QD in fragment part without issuing the DB.

----------------------------------------------------------Configurations---------------------------------------------------------------------------------------------------------
       My setup is SSD based, 1 OSD, pool with 100pg and size =1. The pattern I am working on is 4KB random write(QD=8) on top of RBD(using fio-librbd).FIO configuration is:
                bs=4k
iodepth=8
size=10g
iodepth_batch_submit=1
iodepth_batch_complete=1

       The tuning I am using are listed here, this might not be the best but already showing something.
                    rocksdb_stats_dump_period_sec = 5
    rocksdb_max_background_compactions = 4
    rocksdb_compaction_threads = 4
    rocksdb_write_buffer_size = 536870912  //512MB
    rocksdb_write_buffer_num = 4
    rocksdb_min_write_buffer_number_to_merge = 2
    rocksdb_level0_file_num_compaction_trigger = 4
    rocksdb_max_bytes_for_level_base = 104857600 //100MB
    rocksdb_target_file_size_base = 10485760      //10MB
    rocksdb_num_levels = 3 // So the MAX_DB_SIZE would be ~10GB(100MB* 10^3), fair enough.
　rocksdb_compression = none

                                                                                                                                                                                                                                                                                                                                                                                Xiaoxi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NewStore performance analysis
  2015-04-20 14:59 NewStore performance analysis Chen, Xiaoxi
@ 2015-04-20 15:39 ` Sage Weil
  2015-04-20 15:55   ` Mark Nelson
  2015-04-20 16:11   ` 回复: " Chen, Xiaoxi
  0 siblings, 2 replies; 9+ messages in thread
From: Sage Weil @ 2015-04-20 15:39 UTC (permalink / raw)
  To: Chen, Xiaoxi
  Cc: Mark Nelson, Somnath Roy, Duan, Jiangang, Zhang, Jian, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6453 bytes --]

On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
> [Resend in plain text]
> 
> Hi,
>        I have played some tunable on RocksDB these days, try to optimize the performance of Newstore.   From the data now ,seems the WA of  RocksDB is not the issue that blocking the performance, and also seems not the fragment part(aio/dio, etc). The issue might be how much OPS rocksdb can offer under 1-write-per-sync workload. I cannot find the number online so I will do it by myself,  if that number is low, maybe we need holding multiple RocksDB instance in one OSD and do some sharding .
> 
> The WAL log of Rocksdb, RocksDB data file and Newstore directory were backed by 3 separate SSDs. 
> /dev/sdc1      156172796    32928 156139868   1% /root/ceph-0-db
> /dev/sdd1      195264572    32928 195231644   1% /root/ceph-0-db-wal
> /dev/sdb1      156172796 10589552 145583244   7% /var/lib/ceph/osd/ceph-0
> 
> Some interesting finds here:
> 
> 1.  Avg_reqsz in SDB(newstore FS part) is 2KB, that is half of the request block size(4KB),  IOPS in iostat(2K) is ~ 2X of the number reported by FIO. BW matched
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.00   58.33     0.00    28.98  1017.51     6.33  108.55    0.00  108.55   1.30   7.60
> sdb               0.00     0.00    0.00 2038.00     0.00     3.98     4.00     0.13    0.07    0.00    0.07   0.07  13.33
> sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdd               0.00   747.67    0.00 2099.67     0.00    11.28    11.00     0.76    0.36    0.00    0.36   0.36  75.73  
> 
> I believe newstore will not split the request, so there should be some very small IO(~0KB) goes with the data write(4KB), where the small IO comes from ?  
> 
> Also checked the Filestore data, this behavior is not present in Filestore, changing the WBThrottle will affect the number. So seems this behavior is related with the flushing mechanism? In newstore we are doing fdatasync more aggressively.

Yeah, it sounds like the difference is that newstore is doing immediate 
fdatasync's (on new objects or appends, and on applying post-commit wal 
items).  The 2k IOs are probably the xfs journal commit?

> 2. Notice that by tuning the write_buffer_size , wirte_buffer_num and 
> min_write_buffer_number_to_merge, we can make the DB write to ZERO
> 
> Look at the iostat of SDC, actually there is almost no IO happened 
> there, that is because most of the WAL entries were merged before 
> flushing to Level0.
> 
> Other RocksDB tuning are originally trying  to optimize the compaction 
> behavior, but since there is few data written to Level0, the compaction 
> is almost unmeasurable here.

This is good news.  Was the overlay code being used in this case?  (By 
default it should kick in for 4k writes unless you do 'newstore overlay 
max = 0' or similar.  If we can confirm that our wal writes aren't being 
amplified at all that's great news.

> 3. Disable RocksDB WAL can 3X  the performance(Although this is 
> definitely WRONG WAY)
> 
> Just curious if there is no extra IO happened in DB side, what the 
> performance looks like. I turn off the WAL log of rocks DB, the 
> performance is 3x(799-2464 , lat from 10 -> 3.2)
> 
> 4. The avg queue size is <1 in any case, both DB_WAL part and fragment 
> part.
> 
> I guess there is some lock in rocksdb::WriteBatch() that preventing 
> multiple OSD_OP_THREAD working concurrently, not carefully analyzed.

I think it's just newstore, actually.  The only thing that ever triggers a 
commit/sync is the _kv_sync_thread, which calls submit_transaction_sync(), 
and it's just one thread.  On the one hand it's kind of lame to have 
this loop pushing queued transactions to disk.  On the other hand it 
serves to throttle work and provide fairness with all the other IO we 
are generating.

Again, I think the main limiting factor here though is going to be how 
rocksdb implements its WAL (as a file which requires 2 IOs per commit, one 
to write the data block(s) and one to update/journal the file size 
and/or allocation changes).
 
> An easy way to measure might be comment out  
> db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to see if 
> we can get more QD in fragment part without issuing the DB.

I'm not sure I totally understand the interface.. my assumption is that 
queue_transaction will give rocksdb the txn to commit whenever it finds it 
convenient (no idea what policy is used there) and queue_transaction_sync 
will trigger a commit now.  If we did have multiple threads doing 
queue_trandsaction_sync (by, say, calling it directly in _txc_submit_kv) 
would qa go up?

Thanks!
sage




> 
> 
> ----------------------------------------------------------Configurations---------------------------------------------------------------------------------------------------------
>        My setup is SSD based, 1 OSD, pool with 100pg and size =1. The pattern I am working on is 4KB random write(QD=8) on top of RBD(using fio-librbd).FIO configuration is:
>                 bs=4k
> iodepth=8
> size=10g
> iodepth_batch_submit=1
> iodepth_batch_complete=1
> 
>        The tuning I am using are listed here, this might not be the best but already showing something.
>                     rocksdb_stats_dump_period_sec = 5
>     rocksdb_max_background_compactions = 4
>     rocksdb_compaction_threads = 4
>     rocksdb_write_buffer_size = 536870912  //512MB
>     rocksdb_write_buffer_num = 4
>     rocksdb_min_write_buffer_number_to_merge = 2
>     rocksdb_level0_file_num_compaction_trigger = 4
>     rocksdb_max_bytes_for_level_base = 104857600 //100MB
>     rocksdb_target_file_size_base = 10485760      //10MB
>     rocksdb_num_levels = 3 // So the MAX_DB_SIZE would be ~10GB(100MB* 10^3), fair enough.
> ?rocksdb_compression = none
> 
> 
>                                                                                                                                                                                                                                                                                                                                                                                 Xiaoxi
> 
> ?
>         
> N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NewStore performance analysis
  2015-04-20 15:39 ` Sage Weil
@ 2015-04-20 15:55   ` Mark Nelson
  2015-04-20 16:11   ` 回复: " Chen, Xiaoxi
  1 sibling, 0 replies; 9+ messages in thread
From: Mark Nelson @ 2015-04-20 15:55 UTC (permalink / raw)
  To: Sage Weil, Chen, Xiaoxi
  Cc: Somnath Roy, Duan, Jiangang, Zhang, Jian, ceph-devel

On 04/20/2015 10:39 AM, Sage Weil wrote:
> On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
>> [Resend in plain text]
>>
>> Hi,
>>         I have played some tunable on RocksDB these days, try to optimize the performance of Newstore.   From the data now ,seems the WA of  RocksDB is not the issue that blocking the performance, and also seems not the fragment part(aio/dio, etc). The issue might be how much OPS rocksdb can offer under 1-write-per-sync workload. I cannot find the number online so I will do it by myself,  if that number is low, maybe we need holding multiple RocksDB instance in one OSD and do some sharding .
>>
>> The WAL log of Rocksdb, RocksDB data file and Newstore directory were backed by 3 separate SSDs.
>> /dev/sdc1      156172796    32928 156139868   1% /root/ceph-0-db
>> /dev/sdd1      195264572    32928 195231644   1% /root/ceph-0-db-wal
>> /dev/sdb1      156172796 10589552 145583244   7% /var/lib/ceph/osd/ceph-0
>>
>> Some interesting finds here:
>>
>> 1.  Avg_reqsz in SDB(newstore FS part) is 2KB, that is half of the request block size(4KB),  IOPS in iostat(2K) is ~ 2X of the number reported by FIO. BW matched
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sda               0.00     0.00    0.00   58.33     0.00    28.98  1017.51     6.33  108.55    0.00  108.55   1.30   7.60
>> sdb               0.00     0.00    0.00 2038.00     0.00     3.98     4.00     0.13    0.07    0.00    0.07   0.07  13.33
>> sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
>> sdd               0.00   747.67    0.00 2099.67     0.00    11.28    11.00     0.76    0.36    0.00    0.36   0.36  75.73
>>
>> I believe newstore will not split the request, so there should be some very small IO(~0KB) goes with the data write(4KB), where the small IO comes from ?
>>
>> Also checked the Filestore data, this behavior is not present in Filestore, changing the WBThrottle will affect the number. So seems this behavior is related with the flushing mechanism? In newstore we are doing fdatasync more aggressively.
>
> Yeah, it sounds like the difference is that newstore is doing immediate
> fdatasync's (on new objects or appends, and on applying post-commit wal
> items).  The 2k IOs are probably the xfs journal commit?
>
>> 2. Notice that by tuning the write_buffer_size , wirte_buffer_num and
>> min_write_buffer_number_to_merge, we can make the DB write to ZERO
>>
>> Look at the iostat of SDC, actually there is almost no IO happened
>> there, that is because most of the WAL entries were merged before
>> flushing to Level0.
>>
>> Other RocksDB tuning are originally trying  to optimize the compaction
>> behavior, but since there is few data written to Level0, the compaction
>> is almost unmeasurable here.
>
> This is good news.  Was the overlay code being used in this case?  (By
> default it should kick in for 4k writes unless you do 'newstore overlay
> max = 0' or similar.  If we can confirm that our wal writes aren't being
> amplified at all that's great news.

So I should retest, but with overlay disabled I thought I was still 
seeing writes into level 0 (and ultimately propagated to level 4) when 
testing on the SSD using 6 512MB buffers and 
min_write_buffer_number_to_merge = 2 on my SSD setup.  I'll try poking 
at it some more using the other settings Xiaoxi tested.  The good news 
is that with all of the changes we've made, spinning disk write 
performance is getting much closer to (and sometimes beating!) 
filestore.  Sent some results along in the other thread.

>
>> 3. Disable RocksDB WAL can 3X  the performance(Although this is
>> definitely WRONG WAY)
>>
>> Just curious if there is no extra IO happened in DB side, what the
>> performance looks like. I turn off the WAL log of rocks DB, the
>> performance is 3x(799-2464 , lat from 10 -> 3.2)
>>
>> 4. The avg queue size is <1 in any case, both DB_WAL part and fragment
>> part.
>>
>> I guess there is some lock in rocksdb::WriteBatch() that preventing
>> multiple OSD_OP_THREAD working concurrently, not carefully analyzed.
>
> I think it's just newstore, actually.  The only thing that ever triggers a
> commit/sync is the _kv_sync_thread, which calls submit_transaction_sync(),
> and it's just one thread.  On the one hand it's kind of lame to have
> this loop pushing queued transactions to disk.  On the other hand it
> serves to throttle work and provide fairness with all the other IO we
> are generating.
>
> Again, I think the main limiting factor here though is going to be how
> rocksdb implements its WAL (as a file which requires 2 IOs per commit, one
> to write the data block(s) and one to update/journal the file size
> and/or allocation changes).

I'm still struck by the massive performance loss on my SSD configuration 
going from 4MB IOs to 2MB IOs.  SSD theoretical is around 1.7GB/s.  With 
4MB IOs on recent newstore we can achieve a little north of 1GB/s (ie 
better than filestore!), but as soon as we drop to 2MB IOs performance 
drops to 200MB/s while filestore stays around 600MB/s.  The partial 
object writes really hurt.

>
>> An easy way to measure might be comment out
>> db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to see if
>> we can get more QD in fragment part without issuing the DB.
>
> I'm not sure I totally understand the interface.. my assumption is that
> queue_transaction will give rocksdb the txn to commit whenever it finds it
> convenient (no idea what policy is used there) and queue_transaction_sync
> will trigger a commit now.  If we did have multiple threads doing
> queue_trandsaction_sync (by, say, calling it directly in _txc_submit_kv)
> would qa go up?
>
> Thanks!
> sage
>
>
>
>
>>
>>
>> ----------------------------------------------------------Configurations---------------------------------------------------------------------------------------------------------
>>         My setup is SSD based, 1 OSD, pool with 100pg and size =1. The pattern I am working on is 4KB random write(QD=8) on top of RBD(using fio-librbd).FIO configuration is:
>>                  bs=4k
>> iodepth=8
>> size=10g
>> iodepth_batch_submit=1
>> iodepth_batch_complete=1
>>
>>         The tuning I am using are listed here, this might not be the best but already showing something.
>>                      rocksdb_stats_dump_period_sec = 5
>>      rocksdb_max_background_compactions = 4
>>      rocksdb_compaction_threads = 4
>>      rocksdb_write_buffer_size = 536870912  //512MB
>>      rocksdb_write_buffer_num = 4
>>      rocksdb_min_write_buffer_number_to_merge = 2
>>      rocksdb_level0_file_num_compaction_trigger = 4
>>      rocksdb_max_bytes_for_level_base = 104857600 //100MB
>>      rocksdb_target_file_size_base = 10485760      //10MB
>>      rocksdb_num_levels = 3 // So the MAX_DB_SIZE would be ~10GB(100MB* 10^3), fair enough.
>> ?rocksdb_compression = none
>>
>>
>>                                                                                                                                                                                                                                                                                                                                                                                  Xiaoxi
>>
>> ?
>>
>> N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 9+ messages in thread

* 回复: Re: NewStore performance analysis
  2015-04-20 15:39 ` Sage Weil
  2015-04-20 15:55   ` Mark Nelson
@ 2015-04-20 16:11   ` Chen, Xiaoxi
       [not found]     ` <alpine.DEB.2.00.1504200945000.18547@cobra.newdream.net>
  1 sibling, 1 reply; 9+ messages in thread
From: Chen, Xiaoxi @ 2015-04-20 16:11 UTC (permalink / raw)
  To: Sage Weil
  Cc: Mark Nelson, Somnath Roy, Duan, Jiangang, Zhang, Jian, ceph-devel





---- Sage Weil编写 ----

> On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
> > [Resend in plain text]
> > 
> > Hi,
> >        I have played some tunable on RocksDB these days, try to optimize the performance of Newstore.   From the data now ,seems the WA of  RocksDB is not the issue that blocking the performance, and also seems not the fragment part(aio/dio, etc). The issue might be how much OPS rocksdb can offer under 1-write-per-sync workload. I cannot find the number online so I will do it by myself,  if that number is low, maybe we need holding multiple RocksDB instance in one OSD and do some sharding .
> > 
> > The WAL log of Rocksdb, RocksDB data file and Newstore directory were backed by 3 separate SSDs. 
> > /dev/sdc1      156172796    32928 156139868   1% /root/ceph-0-db
> > /dev/sdd1      195264572    32928 195231644   1% /root/ceph-0-db-wal
> > /dev/sdb1      156172796 10589552 145583244   7% /var/lib/ceph/osd/ceph-0
> > 
> > Some interesting finds here:
> > 
> > 1.  Avg_reqsz in SDB(newstore FS part) is 2KB, that is half of the request block size(4KB),  IOPS in iostat(2K) is ~ 2X of the number reported by FIO. BW matched
> > 
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sda               0.00     0.00    0.00   58.33     0.00    28.98  1017.51     6.33  108.55    0.00  108.55   1.30   7.60
> > sdb               0.00     0.00    0.00 2038.00     0.00     3.98     4.00     0.13    0.07    0.00    0.07   0.07  13.33
> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
> > sdd               0.00   747.67    0.00 2099.67     0.00    11.28    11.00     0.76    0.36    0.00    0.36   0.36  75.73  
> > 
> > I believe newstore will not split the request, so there should be some very small IO(~0KB) goes with the data write(4KB), where the small IO comes from ?  
> > 
> > Also checked the Filestore data, this behavior is not present in Filestore, changing the WBThrottle will affect the number. So seems this behavior is related with the flushing mechanism? In newstore we are doing fdatasync more aggressively.
> 
> Yeah, it sounds like the difference is that newstore is doing immediate 
> fdatasync's (on new objects or appends, and on applying post-commit wal 
> items).  The 2k IOs are probably the xfs journal commit?

> 
> > 2. Notice that by tuning the write_buffer_size , wirte_buffer_num and 
> > min_write_buffer_number_to_merge, we can make the DB write to ZERO
> > 
> > Look at the iostat of SDC, actually there is almost no IO happened 
> > there, that is because most of the WAL entries were merged before 
> > flushing to Level0.
> > 
> > Other RocksDB tuning are originally trying  to optimize the compaction 
> > behavior, but since there is few data written to Level0, the compaction 
> > is almost unmeasurable here.
> 
> This is good news.  Was the overlay code being used in this case?  (By 
> default it should kick in for 4k writes unless you do 'newstore overlay 
> max = 0' or similar.  If we can confirm that our wal writes aren't being 
> amplified at all that's great news.
> 
No, I disabled overlay in all cases. I think with the latest commit in wip-newstore, we can capping the total amount of wal, with that I think we can caculate the write.buffer size and other tunable.
> > 3. Disable RocksDB WAL can 3X  the performance(Although this is 
> > definitely WRONG WAY)
> > 
> > Just curious if there is no extra IO happened in DB side, what the 
> > performance looks like. I turn off the WAL log of rocks DB, the 
> > performance is 3x(799-2464 , lat from 10 -> 3.2)
> > 
> > 4. The avg queue size is <1 in any case, both DB_WAL part and fragment 
> > part.
> > 
> > I guess there is some lock in rocksdb::WriteBatch() that preventing 
> > multiple OSD_OP_THREAD working concurrently, not carefully analyzed.
> 
> I think it's just newstore, actually.  The only thing that ever triggers a 
> commit/sync is the _kv_sync_thread, which calls submit_transaction_sync(), 
> and it's just one thread.  On the one hand it's kind of lame to have 
> this loop pushing queued transactions to disk.  On the other hand it 
> serves to throttle work and provide fairness with all the other IO we 
> are generating.
> 
> Again, I think the main limiting factor here though is going to be how 
> rocksdb implements its WAL (as a file which requires 2 IOs per commit, one 
> to write the data block(s) and one to update/journal the file size 
> and/or allocation changes).
>  
That is true, have the rocksdb community plan to optimizs it?   
The wal is in a seperate ssd in this case, so seems this is not the limit factor in this test.


> > An easy way to measure might be comment out  
> > db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to see if 
> > we can get more QD in fragment part without issuing the DB.
> 
> I'm not sure I totally understand the interface.. my assumption is that 
> queue_transaction will give rocksdb the txn to commit whenever it finds it 
> convenient (no idea what policy is used there) and queue_transaction_sync 
> will trigger a commit now.  If we did have multiple threads doing 
> queue_trandsaction_sync (by, say, calling it directly in _txc_submit_kv) 
> would qa go up?
> 
I think you might miss something, currently the two interface are exactly the SAME unless you set rocksdb-disable-sync=true(which is false by default).

When commit, rocksdb will write the content to both memtable(write buffer) and WAL. if the transaction  doesnt go with sync, it will also commit now,but the write to WAL will NOT be sync(by calling fdatasync). That means we may lose data if power failure/kernel panic. This is why i changed the default rocksdb-disable-sync from true to false in previous patch.
        Thanks
          Xiaoxi



> Thanks!
> sage
> 
> 
> 
> 
> > 
> > 
> > ----------------------------------------------------------Configurations---------------------------------------------------------------------------------------------------------
> >        My setup is SSD based, 1 OSD, pool with 100pg and size =1. The pattern I am working on is 4KB random write(QD=8) on top of RBD(using fio-librbd).FIO configuration is:
> >                 bs=4k
> > iodepth=8
> > size=10g
> > iodepth_batch_submit=1
> > iodepth_batch_complete=1
> > 
> >        The tuning I am using are listed here, this might not be the best but already showing something.
> >                     rocksdb_stats_dump_period_sec = 5
> >     rocksdb_max_background_compactions = 4
> >     rocksdb_compaction_threads = 4
> >     rocksdb_write_buffer_size = 536870912  //512MB
> >     rocksdb_write_buffer_num = 4
> >     rocksdb_min_write_buffer_number_to_merge = 2
> >     rocksdb_level0_file_num_compaction_trigger = 4
> >     rocksdb_max_bytes_for_level_base = 104857600 //100MB
> >     rocksdb_target_file_size_base = 10485760      //10MB
> >     rocksdb_num_levels = 3 // So the MAX_DB_SIZE would be ~10GB(100MB* 10^3), fair enough.
> > ?rocksdb_compression = none
> > 
> > 
> >                                                                                                                                                                                                                                                                                                                                                                                 Xiaoxi
> > 
> > ?
> >         
> > N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: 回复: Re: NewStore performance analysis
       [not found]     ` <alpine.DEB.2.00.1504200945000.18547@cobra.newdream.net>
@ 2015-04-21  6:43       ` Chen, Xiaoxi
  2015-04-21  8:51         ` Haomai Wang
  0 siblings, 1 reply; 9+ messages in thread
From: Chen, Xiaoxi @ 2015-04-21  6:43 UTC (permalink / raw)
  To: Sage Weil
  Cc: Mark Nelson, Somnath Roy, Duan, Jiangang, Zhang, Jian, ceph-devel

Hi Sage,
	Well, that's 
		submit_transaction -- submit a transaction , whether block waiting for fdatasync depends on rocksdb-disable-sync.  
		submit_transaction_sync -- queue transaction and wait until it is stable on disk.
	So if we default rocksdb-disable-sync to false, the two API are same. I haven't look at the LevelDB but I suspect it's similar.

	I just re-read the Newstore code, seems the workflow is not as that we want. We issue a bunch of submit_transaction and in the _kv_sync_thread we try to have a checkpoint that ensure previous transaction are persistent, by using submit_transcation_sync to submit an empty transaction.  But actually 
	1. the submit_transaction is already a synchronized call so the empty transcation in _kv_sync_thread is kind of waste.
	2. An sync transaction cannot ensure the previous transaction is also synced. The API doesn't guarantee this, and from implementation, this two transactions may goes to different WAL files.  

	Yes, if we want, we can have a Queue and Thread that collecting the transactions and merge them to a big transaction , some ::fdatasync will be saved here. But this approach looks complex. 

	Some optimizations in my mind are:
	1. Batch the cleanup operations in _apply_wal_transaction, we don’t need to synchronized remove the WAL item, we can just put them into kv_sync_thread_Q and let kv_sync_thread to form a batch transaction that deleted a bunch of key.
	2. We don't need the empty transaction in kv_sync_thread, we could call the _txc_kv_finish_kv directly from _txc_submit_kv,  since the KV is synchronized.
              	3.  Then we can rename _kv_sync_thread to _kv_cleanup_thread to better descript its work. 

	How do you think

															Xiaoxi
-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Tuesday, April 21, 2015 12:48 AM
To: Chen, Xiaoxi
Cc: Mark Nelson; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel
Subject: Re: 回复: Re: NewStore performance analysis

On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
> > > An easy way to measure might be comment out
> > > db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to 
> > > db->see if
> > > we can get more QD in fragment part without issuing the DB.
> > 
> > I'm not sure I totally understand the interface.. my assumption is 
> > that queue_transaction will give rocksdb the txn to commit whenever 
> > it finds it convenient (no idea what policy is used there) and 
> > queue_transaction_sync will trigger a commit now.  If we did have 
> > multiple threads doing queue_trandsaction_sync (by, say, calling it 
> > directly in _txc_submit_kv) would qa go up?
> > 
> I think you might miss something, currently the two interface are 
> exactly the SAME unless you set rocksdb-disable-sync=true(which is 
> false by default).
> 
> When commit, rocksdb will write the content to both memtable(write
> buffer) and WAL. if the transaction doesnt go with sync, it will also 
> commit now,but the write to WAL will NOT be sync(by calling fdatasync).
> That means we may lose data if power failure/kernel panic. This is why 
> i changed the default rocksdb-disable-sync from true to false in 
> previous patch.

Yeah, I'm confused.  :)

So now 'rocksdb disable sync = false', which seems to be obviously what we want for newstore.  It's different for filestore, which is doing a syncfs checkpoint.  Perhaps we should have newstore set that explicitly instead of passing through a config option.

In any case, though, I'm confused by

> if the transaction doesnt go with sync, it will also commit now,but 
> the write to WAL will NOT be sync(by calling fdatasync).

What does it mean to 'commit' but not call fdatasync?  What does commit mean in this case?

And, and I correct in understanding that we have

 queue_transaction -- queue a transaction but don't block waiting for fdatasync  queue_transaction_sync -- queue transaction and wait until it is stable on disk

to work with?

Thanks!
sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 回复: Re: NewStore performance analysis
  2015-04-21  6:43       ` Chen, Xiaoxi
@ 2015-04-21  8:51         ` Haomai Wang
       [not found]           ` <alpine.DEB.2.00.1504211246450.18547@cobra.ne <alpine.DEB.2.00.1504211627110.18547@cobra.newdream.net>
  0 siblings, 1 reply; 9+ messages in thread
From: Haomai Wang @ 2015-04-21  8:51 UTC (permalink / raw)
  To: Chen, Xiaoxi
  Cc: Sage Weil, Mark Nelson, Somnath Roy, Duan, Jiangang, Zhang, Jian,
	ceph-devel

On Tue, Apr 21, 2015 at 2:43 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
> Hi Sage,
>         Well, that's
>                 submit_transaction -- submit a transaction , whether block waiting for fdatasync depends on rocksdb-disable-sync.
>                 submit_transaction_sync -- queue transaction and wait until it is stable on disk.
>         So if we default rocksdb-disable-sync to false, the two API are same. I haven't look at the LevelDB but I suspect it's similar.

Eh, I don't think it's the same. By default WriteOption.disableWAL is
false in our ceph side, and submit_transaction will use
WriteOption.sync=false and submit_transaction_sync will use
WriteOption.sync=true.

If sync==fase, rocksdb won't sync log file, otherwise it will call
fsync/fdatasync to flush log file.

Plz correct me if not. :-)

>
>         I just re-read the Newstore code, seems the workflow is not as that we want. We issue a bunch of submit_transaction and in the _kv_sync_thread we try to have a checkpoint that ensure previous transaction are persistent, by using submit_transcation_sync to submit an empty transaction.  But actually
>         1. the submit_transaction is already a synchronized call so the empty transcation in _kv_sync_thread is kind of waste.
>         2. An sync transaction cannot ensure the previous transaction is also synced. The API doesn't guarantee this, and from implementation, this two transactions may goes to different WAL files.
>
>         Yes, if we want, we can have a Queue and Thread that collecting the transactions and merge them to a big transaction , some ::fdatasync will be saved here. But this approach looks complex.
>
>         Some optimizations in my mind are:
>         1. Batch the cleanup operations in _apply_wal_transaction, we don’t need to synchronized remove the WAL item, we can just put them into kv_sync_thread_Q and let kv_sync_thread to form a batch transaction that deleted a bunch of key.
>         2. We don't need the empty transaction in kv_sync_thread, we could call the _txc_kv_finish_kv directly from _txc_submit_kv,  since the KV is synchronized.
>                 3.  Then we can rename _kv_sync_thread to _kv_cleanup_thread to better descript its work.
>
>         How do you think
>
>                                                                                                                         Xiaoxi
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Tuesday, April 21, 2015 12:48 AM
> To: Chen, Xiaoxi
> Cc: Mark Nelson; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel
> Subject: Re: 回复: Re: NewStore performance analysis
>
> On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
>> > > An easy way to measure might be comment out
>> > > db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to
>> > > db->see if
>> > > we can get more QD in fragment part without issuing the DB.
>> >
>> > I'm not sure I totally understand the interface.. my assumption is
>> > that queue_transaction will give rocksdb the txn to commit whenever
>> > it finds it convenient (no idea what policy is used there) and
>> > queue_transaction_sync will trigger a commit now.  If we did have
>> > multiple threads doing queue_trandsaction_sync (by, say, calling it
>> > directly in _txc_submit_kv) would qa go up?
>> >
>> I think you might miss something, currently the two interface are
>> exactly the SAME unless you set rocksdb-disable-sync=true(which is
>> false by default).
>>
>> When commit, rocksdb will write the content to both memtable(write
>> buffer) and WAL. if the transaction doesnt go with sync, it will also
>> commit now,but the write to WAL will NOT be sync(by calling fdatasync).
>> That means we may lose data if power failure/kernel panic. This is why
>> i changed the default rocksdb-disable-sync from true to false in
>> previous patch.
>
> Yeah, I'm confused.  :)
>
> So now 'rocksdb disable sync = false', which seems to be obviously what we want for newstore.  It's different for filestore, which is doing a syncfs checkpoint.  Perhaps we should have newstore set that explicitly instead of passing through a config option.
>
> In any case, though, I'm confused by
>
>> if the transaction doesnt go with sync, it will also commit now,but
>> the write to WAL will NOT be sync(by calling fdatasync).
>
> What does it mean to 'commit' but not call fdatasync?  What does commit mean in this case?
>
> And, and I correct in understanding that we have
>
>  queue_transaction -- queue a transaction but don't block waiting for fdatasync  queue_transaction_sync -- queue transaction and wait until it is stable on disk
>
> to work with?
>
> Thanks!
> sage



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* 回复: Re: 回复: Re: 回复: Re: NewStore performance analysis
       [not found]             ` <alpine.DEB.2.00.1504211627110.18547@cobra.newdream.net>
@ 2015-04-21 23:47               ` Chen, Xiaoxi
       [not found]                 ` <alpine.DEB.2.00.1504211654560.18547@cobra.newdream.net>
  0 siblings, 1 reply; 9+ messages in thread
From: Chen, Xiaoxi @ 2015-04-21 23:47 UTC (permalink / raw)
  To: Sage Weil
  Cc: Haomai Wang, Mark Nelson, Somnath Roy, Duan, Jiangang,
	Zhang, Jian, ceph-devel



---- Sage Weil编写 ----

> On Tue, 21 Apr 2015, Chen, Xiaoxi wrote:
> > Haomai is right in theory, but I am not sure whether all 
> > user(mon,filestore,kvstore) of submit_transaction API clearly holding 
> > the expectation that their data is not persistent and may lost in 
> > failure.  So in rocksdb now the sync is default to true even in 
> > submit_transaction(and this option make the two api exactly the same). 
> > Maybe we need to rename the api to 
> > submit_transaction_persistent/nonpersistent to better discribe the 
> > behavior?
> 
> Let's audit them, then.. I think they are right, but we may as well 
> confirm!
> 
> Again, FileStore is the odd one out here because it is relying on the 
> syncfs(2) at commit time for everything.
> 

Yes, so maybe we dont need to expose the option to user, we can decide whether to.sync in code logic.

I remember some folks in out team tried to move KVDB to a partition on SSD while leave other filestore data on HDD, in my memory it benifit performance.  This deployment is problematic with kv_sync=false.  gWill check the data first and then we can evaluate whethe we want to support this kind of deployment.

> > And yeah, whether a sync(persistent) transaction success can persist the 
> > previous non-sync(unpersistent) transaction is the main issue here.
> 
> Yeah. I posted to facebook, no reply yet!
> 
> s
> 
> > 
> > ---- Sage Weil?? ----
> > 
> > 
> > On Tue, 21 Apr 2015, Haomai Wang wrote:
> > > On Tue, Apr 21, 2015 at 2:43 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
> > > > Hi Sage,
> > > >         Well, that's
> > > >                 submit_transaction -- submit a transaction , whether block waiting for fdatasync depends on rocksdb-disable-sync.
> > > >                 submit_transaction_sync -- queue transaction and wait until it is stable on disk.
> > > >         So if we default rocksdb-disable-sync to false, the two API are same. I haven't look at the LevelDB but I suspect it's similar.
> > >
> > > Eh, I don't think it's the same. By default WriteOption.disableWAL is
> > > false in our ceph side, and submit_transaction will use
> > > WriteOption.sync=false and submit_transaction_sync will use
> > > WriteOption.sync=true.
> > >
> > > If sync==fase, rocksdb won't sync log file, otherwise it will call
> > > fsync/fdatasync to flush log file.
> > >
> > > Plz correct me if not. :-)
> > 
> > That's what it looks like to me, too.  I think the disable sync is a
> > separate optimization for bulk data loading that only filestore wants
> > (because it calls sync(2); we should probably set the option explicitly in
> > FileStore.cc instead of exposing as a ceph option?).
> > 
> > I think the issue with what we have now is that a
> > submit_transaction_sync() with an empty transaction may not sync previous
> > transactions if the log rolled over?  I would really expect that it would,
> > though... :/  I'll ask on the rocksdb facebook page.
> > 
> > sage
> > 
> > 
> > >
> > > >
> > > >         I just re-read the Newstore code, seems the workflow is not as that we want. We issue a bunch of submit_transaction and in the _kv_sync_thread we try to have a checkpoint that ensure previous transaction are persistent, by using submit_transcation_sync to submit an empty transaction.  But actually
> > > >         1. the submit_transaction is already a synchronized call so the empty transcation in _kv_sync_thread is kind of waste.
> > > >         2. An sync transaction cannot ensure the previous transaction is also synced. The API doesn't guarantee this, and from implementation, this two transactions may goes to different WAL files.
> > > >
> > > >         Yes, if we want, we can have a Queue and Thread that collecting the transactions and merge them to a big transaction , some ::fdatasync will be saved here. But this approach looks complex.
> > > >
> > > >         Some optimizations in my mind are:
> > > >         1. Batch the cleanup operations in _apply_wal_transaction, we don?t need to synchronized remove the WAL item, we can just put them into kv_sync_thread_Q and let kv_sync_thread to form a batch transaction that deleted a bunch of key.
> > > >         2. We don't need the empty transaction in kv_sync_thread, we could call the _txc_kv_finish_kv directly from _txc_submit_kv,  since the KV is synchronized.
> > > >                 3.  Then we can rename _kv_sync_thread to _kv_cleanup_thread to better descript its work.
> > > >
> > > >         How do you think
> > > >
> > > >                                                                                                                         Xiaoxi
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Tuesday, April 21, 2015 12:48 AM
> > > > To: Chen, Xiaoxi
> > > > Cc: Mark Nelson; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel
> > > > Subject: Re: ??: Re: NewStore performance analysis
> > > >
> > > > On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
> > > >> > > An easy way to measure might be comment out
> > > >> > > db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to
> > > >> > > db->see if
> > > >> > > we can get more QD in fragment part without issuing the DB.
> > > >> >
> > > >> > I'm not sure I totally understand the interface.. my assumption is
> > > >> > that queue_transaction will give rocksdb the txn to commit whenever
> > > >> > it finds it convenient (no idea what policy is used there) and
> > > >> > queue_transaction_sync will trigger a commit now.  If we did have
> > > >> > multiple threads doing queue_trandsaction_sync (by, say, calling it
> > > >> > directly in _txc_submit_kv) would qa go up?
> > > >> >
> > > >> I think you might miss something, currently the two interface are
> > > >> exactly the SAME unless you set rocksdb-disable-sync=true(which is
> > > >> false by default).
> > > >>
> > > >> When commit, rocksdb will write the content to both memtable(write
> > > >> buffer) and WAL. if the transaction doesnt go with sync, it will also
> > > >> commit now,but the write to WAL will NOT be sync(by calling fdatasync).
> > > >> That means we may lose data if power failure/kernel panic. This is why
> > > >> i changed the default rocksdb-disable-sync from true to false in
> > > >> previous patch.
> > > >
> > > > Yeah, I'm confused.  :)
> > > >
> > > > So now 'rocksdb disable sync = false', which seems to be obviously what we want for newstore.  It's different for filestore, which is doing a syncfs checkpoint.  Perhaps we should have newstore set that explicitly instead of passing through a config option.
> > > >
> > > > In any case, though, I'm confused by
> > > >
> > > >> if the transaction doesnt go with sync, it will also commit now,but
> > > >> the write to WAL will NOT be sync(by calling fdatasync).
> > > >
> > > > What does it mean to 'commit' but not call fdatasync?  What does commit mean in this case?
> > > >
> > > > And, and I correct in understanding that we have
> > > >
> > > >  queue_transaction -- queue a transaction but don't block waiting for fdatasync  queue_transaction_sync -- queue transaction and wait until it is stable on disk
> > > >
> > > > to work with?
> > > >
> > > > Thanks!
> > > > sage
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > >
> > > Wheat
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 回复: Re: 回复: Re: 回复: Re: NewStore performance analysis
       [not found]                 ` <alpine.DEB.2.00.1504211654560.18547@cobra.newdream.net>
@ 2015-04-21 23:59                   ` Mark Nelson
  2015-04-22  3:34                     ` Chen, Xiaoxi
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Nelson @ 2015-04-21 23:59 UTC (permalink / raw)
  To: Sage Weil, Chen, Xiaoxi
  Cc: Haomai Wang, Somnath Roy, Duan, Jiangang, Zhang, Jian, ceph-devel

On 04/21/2015 06:57 PM, Sage Weil wrote:
> On Tue, 21 Apr 2015, Chen, Xiaoxi wrote:
>> ---- Sage Weil?? ----
>>
>>> On Tue, 21 Apr 2015, Chen, Xiaoxi wrote:
>>>> Haomai is right in theory, but I am not sure whether all
>>>> user(mon,filestore,kvstore) of submit_transaction API clearly holding
>>>> the expectation that their data is not persistent and may lost in
>>>> failure.  So in rocksdb now the sync is default to true even in
>>>> submit_transaction(and this option make the two api exactly the same).
>>>> Maybe we need to rename the api to
>>>> submit_transaction_persistent/nonpersistent to better discribe the
>>>> behavior?
>>>
>>> Let's audit them, then.. I think they are right, but we may as well
>>> confirm!
>>>
>>> Again, FileStore is the odd one out here because it is relying on the
>>> syncfs(2) at commit time for everything.
>>>
>>
>> Yes, so maybe we dont need to expose the option to user, we can decide
>> whether to.sync in code logic.
>
> Yeah, I think it'll reduce confusion too.  I suggest we do a pull request
> against master that does this... let me know if you want to do it,
> otherwise I will!
>
>> I remember some folks in out team tried to move KVDB to a partition on
>> SSD while leave other filestore data on HDD, in my memory it benifit
>> performance.  This deployment is problematic with kv_sync=false.  gWill
>> check the data first and then we can evaluate whethe we want to support
>> this kind of deployment.
>
> We could detect this by doing a stat(2) on the current/omap/ vs current/
> dirs and checking if it's a different file system.  If so, we can do the
> syncfs(2) on both dirs.  The btrfs case would probably not be practical,
> but we can error out in that case.  But yeah not sure how important it
> would be to support this since filestore doesn't use leveldb that
> heavily... and I'd prefer to limit our investment of time there if we can
> instead make newstore (or something else) better.

FWIW, the last time I tried putting leveldb on SSD didn't really help at 
all.  It's been a while so maybe that's changed, but newstore definitely 
seems like the way forward to me.

Mark

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: 回复: Re: 回复: Re: 回复: Re: NewStore performance analysis
  2015-04-21 23:59                   ` Mark Nelson
@ 2015-04-22  3:34                     ` Chen, Xiaoxi
  0 siblings, 0 replies; 9+ messages in thread
From: Chen, Xiaoxi @ 2015-04-22  3:34 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil
  Cc: Haomai Wang, Somnath Roy, Duan, Jiangang, Zhang, Jian, ceph-devel,
	Xue, Chendi

Hi Sage and Mark,

           Chendi from our team had done the test based on v0.91. The setup is 4 nodes, totally 40HDDs with SSDs as journal, replica=2

           Mount a partition from journal SSD to /current/omap benefit 4K random write IOPS(peak)  from 1524 to 2694, that's 76% while other IO patterns keep the same.

Some details are here.
           If this can reproduce in other setup, I suspect it worth us to investigate some time to do the detection.

		Runid	OP_SIZE	OP_TYPE	QD	Engine	server_num	client_num	rbd_num	RBD_FIO_IOPS	RBD_FIO_BW	RBD_FIO_Latency	osd_read_iops	osd_write_iops	osd_read_bw	
Prev		305	4k	randwrite	qd8	vdb	4	2	40	1524	6170.1	            209.3851	            7.862196	          7677.648	0.446566	54.916435
Omap2ssd	320	4k	randwrite	qd8	vdb	4	2	40	2694	10864.23	119.4587	322.4334	10930	1.409266	71.33833

															Xiaoxi
-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com] 
Sent: Wednesday, April 22, 2015 7:59 AM
To: Sage Weil; Chen, Xiaoxi
Cc: Haomai Wang; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel
Subject: Re: 回复: Re: 回复: Re: 回复: Re: NewStore performance analysis

On 04/21/2015 06:57 PM, Sage Weil wrote:
> On Tue, 21 Apr 2015, Chen, Xiaoxi wrote:
>> ---- Sage Weil?? ----
>>
>>> On Tue, 21 Apr 2015, Chen, Xiaoxi wrote:
>>>> Haomai is right in theory, but I am not sure whether all
>>>> user(mon,filestore,kvstore) of submit_transaction API clearly 
>>>> holding the expectation that their data is not persistent and may 
>>>> lost in failure.  So in rocksdb now the sync is default to true 
>>>> even in submit_transaction(and this option make the two api exactly the same).
>>>> Maybe we need to rename the api to
>>>> submit_transaction_persistent/nonpersistent to better discribe the 
>>>> behavior?
>>>
>>> Let's audit them, then.. I think they are right, but we may as well 
>>> confirm!
>>>
>>> Again, FileStore is the odd one out here because it is relying on 
>>> the
>>> syncfs(2) at commit time for everything.
>>>
>>
>> Yes, so maybe we dont need to expose the option to user, we can 
>> decide whether to.sync in code logic.
>
> Yeah, I think it'll reduce confusion too.  I suggest we do a pull 
> request against master that does this... let me know if you want to do 
> it, otherwise I will!
>
>> I remember some folks in out team tried to move KVDB to a partition 
>> on SSD while leave other filestore data on HDD, in my memory it 
>> benifit performance.  This deployment is problematic with 
>> kv_sync=false.  gWill check the data first and then we can evaluate 
>> whethe we want to support this kind of deployment.
>
> We could detect this by doing a stat(2) on the current/omap/ vs 
> current/ dirs and checking if it's a different file system.  If so, we 
> can do the
> syncfs(2) on both dirs.  The btrfs case would probably not be 
> practical, but we can error out in that case.  But yeah not sure how 
> important it would be to support this since filestore doesn't use 
> leveldb that heavily... and I'd prefer to limit our investment of time 
> there if we can instead make newstore (or something else) better.

FWIW, the last time I tried putting leveldb on SSD didn't really help at all.  It's been a while so maybe that's changed, but newstore definitely seems like the way forward to me.

Mark

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-04-22  3:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-20 14:59 NewStore performance analysis Chen, Xiaoxi
2015-04-20 15:39 ` Sage Weil
2015-04-20 15:55   ` Mark Nelson
2015-04-20 16:11   ` 回复: " Chen, Xiaoxi
     [not found]     ` <alpine.DEB.2.00.1504200945000.18547@cobra.newdream.net>
2015-04-21  6:43       ` Chen, Xiaoxi
2015-04-21  8:51         ` Haomai Wang
     [not found]           ` <alpine.DEB.2.00.1504211246450.18547@cobra.ne <alpine.DEB.2.00.1504211627110.18547@cobra.newdream.net>
     [not found]             ` <alpine.DEB.2.00.1504211627110.18547@cobra.newdream.net>
2015-04-21 23:47               ` 回复: Re: 回复: " Chen, Xiaoxi
     [not found]                 ` <alpine.DEB.2.00.1504211654560.18547@cobra.newdream.net>
2015-04-21 23:59                   ` Mark Nelson
2015-04-22  3:34                     ` Chen, Xiaoxi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.