* NewStore performance analysis
@ 2015-04-20 14:59 Chen, Xiaoxi
2015-04-20 15:39 ` Sage Weil
0 siblings, 1 reply; 9+ messages in thread
From: Chen, Xiaoxi @ 2015-04-20 14:59 UTC (permalink / raw)
To: Sage Weil, Mark Nelson, Somnath Roy, Chen, Xiaoxi
Cc: Duan, Jiangang, Zhang, Jian, ceph-devel
[Resend in plain text]
Hi,
I have played some tunable on RocksDB these days, try to optimize the performance of Newstore. From the data now ,seems the WA of RocksDB is not the issue that blocking the performance, and also seems not the fragment part(aio/dio, etc). The issue might be how much OPS rocksdb can offer under 1-write-per-sync workload. I cannot find the number online so I will do it by myself, if that number is low, maybe we need holding multiple RocksDB instance in one OSD and do some sharding .
The WAL log of Rocksdb, RocksDB data file and Newstore directory were backed by 3 separate SSDs.
/dev/sdc1 156172796 32928 156139868 1% /root/ceph-0-db
/dev/sdd1 195264572 32928 195231644 1% /root/ceph-0-db-wal
/dev/sdb1 156172796 10589552 145583244 7% /var/lib/ceph/osd/ceph-0
Some interesting finds here:
1. Avg_reqsz in SDB(newstore FS part) is 2KB, that is half of the request block size(4KB), IOPS in iostat(2K) is ~ 2X of the number reported by FIO. BW matched
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 58.33 0.00 28.98 1017.51 6.33 108.55 0.00 108.55 1.30 7.60
sdb 0.00 0.00 0.00 2038.00 0.00 3.98 4.00 0.13 0.07 0.00 0.07 0.07 13.33
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 747.67 0.00 2099.67 0.00 11.28 11.00 0.76 0.36 0.00 0.36 0.36 75.73
I believe newstore will not split the request, so there should be some very small IO(~0KB) goes with the data write(4KB), where the small IO comes from ?
Also checked the Filestore data, this behavior is not present in Filestore, changing the WBThrottle will affect the number. So seems this behavior is related with the flushing mechanism? In newstore we are doing fdatasync more aggressively.
2. Notice that by tuning the write_buffer_size , wirte_buffer_num and min_write_buffer_number_to_merge, we can make the DB write to ZERO
Look at the iostat of SDC, actually there is almost no IO happened there, that is because most of the WAL entries were merged before flushing to Level0.
Other RocksDB tuning are originally trying to optimize the compaction behavior, but since there is few data written to Level0, the compaction is almost unmeasurable here.
3. Disable RocksDB WAL can 3X the performance(Although this is definitely WRONG WAY)
Just curious if there is no extra IO happened in DB side, what the performance looks like.
I turn off the WAL log of rocks DB, the performance is 3x(799-2464 , lat from 10 -> 3.2)
4. The avg queue size is <1 in any case, both DB_WAL part and fragment part.
I guess there is some lock in rocksdb::WriteBatch() that preventing multiple OSD_OP_THREAD working concurrently, not carefully analyzed.
An easy way to measure might be comment out db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to see if we can get more QD in fragment part without issuing the DB.
----------------------------------------------------------Configurations---------------------------------------------------------------------------------------------------------
My setup is SSD based, 1 OSD, pool with 100pg and size =1. The pattern I am working on is 4KB random write(QD=8) on top of RBD(using fio-librbd).FIO configuration is:
bs=4k
iodepth=8
size=10g
iodepth_batch_submit=1
iodepth_batch_complete=1
The tuning I am using are listed here, this might not be the best but already showing something.
rocksdb_stats_dump_period_sec = 5
rocksdb_max_background_compactions = 4
rocksdb_compaction_threads = 4
rocksdb_write_buffer_size = 536870912 //512MB
rocksdb_write_buffer_num = 4
rocksdb_min_write_buffer_number_to_merge = 2
rocksdb_level0_file_num_compaction_trigger = 4
rocksdb_max_bytes_for_level_base = 104857600 //100MB
rocksdb_target_file_size_base = 10485760 //10MB
rocksdb_num_levels = 3 // So the MAX_DB_SIZE would be ~10GB(100MB* 10^3), fair enough.
rocksdb_compression = none
Xiaoxi
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: NewStore performance analysis
2015-04-20 14:59 NewStore performance analysis Chen, Xiaoxi
@ 2015-04-20 15:39 ` Sage Weil
2015-04-20 15:55 ` Mark Nelson
2015-04-20 16:11 ` 回复: " Chen, Xiaoxi
0 siblings, 2 replies; 9+ messages in thread
From: Sage Weil @ 2015-04-20 15:39 UTC (permalink / raw)
To: Chen, Xiaoxi
Cc: Mark Nelson, Somnath Roy, Duan, Jiangang, Zhang, Jian, ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 6453 bytes --]
On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
> [Resend in plain text]
>
> Hi,
> I have played some tunable on RocksDB these days, try to optimize the performance of Newstore. From the data now ,seems the WA of RocksDB is not the issue that blocking the performance, and also seems not the fragment part(aio/dio, etc). The issue might be how much OPS rocksdb can offer under 1-write-per-sync workload. I cannot find the number online so I will do it by myself, if that number is low, maybe we need holding multiple RocksDB instance in one OSD and do some sharding .
>
> The WAL log of Rocksdb, RocksDB data file and Newstore directory were backed by 3 separate SSDs.
> /dev/sdc1 156172796 32928 156139868 1% /root/ceph-0-db
> /dev/sdd1 195264572 32928 195231644 1% /root/ceph-0-db-wal
> /dev/sdb1 156172796 10589552 145583244 7% /var/lib/ceph/osd/ceph-0
>
> Some interesting finds here:
>
> 1. Avg_reqsz in SDB(newstore FS part) is 2KB, that is half of the request block size(4KB), IOPS in iostat(2K) is ~ 2X of the number reported by FIO. BW matched
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 0.00 58.33 0.00 28.98 1017.51 6.33 108.55 0.00 108.55 1.30 7.60
> sdb 0.00 0.00 0.00 2038.00 0.00 3.98 4.00 0.13 0.07 0.00 0.07 0.07 13.33
> sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdd 0.00 747.67 0.00 2099.67 0.00 11.28 11.00 0.76 0.36 0.00 0.36 0.36 75.73
>
> I believe newstore will not split the request, so there should be some very small IO(~0KB) goes with the data write(4KB), where the small IO comes from ?
>
> Also checked the Filestore data, this behavior is not present in Filestore, changing the WBThrottle will affect the number. So seems this behavior is related with the flushing mechanism? In newstore we are doing fdatasync more aggressively.
Yeah, it sounds like the difference is that newstore is doing immediate
fdatasync's (on new objects or appends, and on applying post-commit wal
items). The 2k IOs are probably the xfs journal commit?
> 2. Notice that by tuning the write_buffer_size , wirte_buffer_num and
> min_write_buffer_number_to_merge, we can make the DB write to ZERO
>
> Look at the iostat of SDC, actually there is almost no IO happened
> there, that is because most of the WAL entries were merged before
> flushing to Level0.
>
> Other RocksDB tuning are originally trying to optimize the compaction
> behavior, but since there is few data written to Level0, the compaction
> is almost unmeasurable here.
This is good news. Was the overlay code being used in this case? (By
default it should kick in for 4k writes unless you do 'newstore overlay
max = 0' or similar. If we can confirm that our wal writes aren't being
amplified at all that's great news.
> 3. Disable RocksDB WAL can 3X the performance(Although this is
> definitely WRONG WAY)
>
> Just curious if there is no extra IO happened in DB side, what the
> performance looks like. I turn off the WAL log of rocks DB, the
> performance is 3x(799-2464 , lat from 10 -> 3.2)
>
> 4. The avg queue size is <1 in any case, both DB_WAL part and fragment
> part.
>
> I guess there is some lock in rocksdb::WriteBatch() that preventing
> multiple OSD_OP_THREAD working concurrently, not carefully analyzed.
I think it's just newstore, actually. The only thing that ever triggers a
commit/sync is the _kv_sync_thread, which calls submit_transaction_sync(),
and it's just one thread. On the one hand it's kind of lame to have
this loop pushing queued transactions to disk. On the other hand it
serves to throttle work and provide fairness with all the other IO we
are generating.
Again, I think the main limiting factor here though is going to be how
rocksdb implements its WAL (as a file which requires 2 IOs per commit, one
to write the data block(s) and one to update/journal the file size
and/or allocation changes).
> An easy way to measure might be comment out
> db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to see if
> we can get more QD in fragment part without issuing the DB.
I'm not sure I totally understand the interface.. my assumption is that
queue_transaction will give rocksdb the txn to commit whenever it finds it
convenient (no idea what policy is used there) and queue_transaction_sync
will trigger a commit now. If we did have multiple threads doing
queue_trandsaction_sync (by, say, calling it directly in _txc_submit_kv)
would qa go up?
Thanks!
sage
>
>
> ----------------------------------------------------------Configurations---------------------------------------------------------------------------------------------------------
> My setup is SSD based, 1 OSD, pool with 100pg and size =1. The pattern I am working on is 4KB random write(QD=8) on top of RBD(using fio-librbd).FIO configuration is:
> bs=4k
> iodepth=8
> size=10g
> iodepth_batch_submit=1
> iodepth_batch_complete=1
>
> The tuning I am using are listed here, this might not be the best but already showing something.
> rocksdb_stats_dump_period_sec = 5
> rocksdb_max_background_compactions = 4
> rocksdb_compaction_threads = 4
> rocksdb_write_buffer_size = 536870912 //512MB
> rocksdb_write_buffer_num = 4
> rocksdb_min_write_buffer_number_to_merge = 2
> rocksdb_level0_file_num_compaction_trigger = 4
> rocksdb_max_bytes_for_level_base = 104857600 //100MB
> rocksdb_target_file_size_base = 10485760 //10MB
> rocksdb_num_levels = 3 // So the MAX_DB_SIZE would be ~10GB(100MB* 10^3), fair enough.
> ?rocksdb_compression = none
>
>
> Xiaoxi
>
> ?
>
> N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: NewStore performance analysis
2015-04-20 15:39 ` Sage Weil
@ 2015-04-20 15:55 ` Mark Nelson
2015-04-20 16:11 ` 回复: " Chen, Xiaoxi
1 sibling, 0 replies; 9+ messages in thread
From: Mark Nelson @ 2015-04-20 15:55 UTC (permalink / raw)
To: Sage Weil, Chen, Xiaoxi
Cc: Somnath Roy, Duan, Jiangang, Zhang, Jian, ceph-devel
On 04/20/2015 10:39 AM, Sage Weil wrote:
> On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
>> [Resend in plain text]
>>
>> Hi,
>> I have played some tunable on RocksDB these days, try to optimize the performance of Newstore. From the data now ,seems the WA of RocksDB is not the issue that blocking the performance, and also seems not the fragment part(aio/dio, etc). The issue might be how much OPS rocksdb can offer under 1-write-per-sync workload. I cannot find the number online so I will do it by myself, if that number is low, maybe we need holding multiple RocksDB instance in one OSD and do some sharding .
>>
>> The WAL log of Rocksdb, RocksDB data file and Newstore directory were backed by 3 separate SSDs.
>> /dev/sdc1 156172796 32928 156139868 1% /root/ceph-0-db
>> /dev/sdd1 195264572 32928 195231644 1% /root/ceph-0-db-wal
>> /dev/sdb1 156172796 10589552 145583244 7% /var/lib/ceph/osd/ceph-0
>>
>> Some interesting finds here:
>>
>> 1. Avg_reqsz in SDB(newstore FS part) is 2KB, that is half of the request block size(4KB), IOPS in iostat(2K) is ~ 2X of the number reported by FIO. BW matched
>>
>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
>> sda 0.00 0.00 0.00 58.33 0.00 28.98 1017.51 6.33 108.55 0.00 108.55 1.30 7.60
>> sdb 0.00 0.00 0.00 2038.00 0.00 3.98 4.00 0.13 0.07 0.00 0.07 0.07 13.33
>> sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
>> sdd 0.00 747.67 0.00 2099.67 0.00 11.28 11.00 0.76 0.36 0.00 0.36 0.36 75.73
>>
>> I believe newstore will not split the request, so there should be some very small IO(~0KB) goes with the data write(4KB), where the small IO comes from ?
>>
>> Also checked the Filestore data, this behavior is not present in Filestore, changing the WBThrottle will affect the number. So seems this behavior is related with the flushing mechanism? In newstore we are doing fdatasync more aggressively.
>
> Yeah, it sounds like the difference is that newstore is doing immediate
> fdatasync's (on new objects or appends, and on applying post-commit wal
> items). The 2k IOs are probably the xfs journal commit?
>
>> 2. Notice that by tuning the write_buffer_size , wirte_buffer_num and
>> min_write_buffer_number_to_merge, we can make the DB write to ZERO
>>
>> Look at the iostat of SDC, actually there is almost no IO happened
>> there, that is because most of the WAL entries were merged before
>> flushing to Level0.
>>
>> Other RocksDB tuning are originally trying to optimize the compaction
>> behavior, but since there is few data written to Level0, the compaction
>> is almost unmeasurable here.
>
> This is good news. Was the overlay code being used in this case? (By
> default it should kick in for 4k writes unless you do 'newstore overlay
> max = 0' or similar. If we can confirm that our wal writes aren't being
> amplified at all that's great news.
So I should retest, but with overlay disabled I thought I was still
seeing writes into level 0 (and ultimately propagated to level 4) when
testing on the SSD using 6 512MB buffers and
min_write_buffer_number_to_merge = 2 on my SSD setup. I'll try poking
at it some more using the other settings Xiaoxi tested. The good news
is that with all of the changes we've made, spinning disk write
performance is getting much closer to (and sometimes beating!)
filestore. Sent some results along in the other thread.
>
>> 3. Disable RocksDB WAL can 3X the performance(Although this is
>> definitely WRONG WAY)
>>
>> Just curious if there is no extra IO happened in DB side, what the
>> performance looks like. I turn off the WAL log of rocks DB, the
>> performance is 3x(799-2464 , lat from 10 -> 3.2)
>>
>> 4. The avg queue size is <1 in any case, both DB_WAL part and fragment
>> part.
>>
>> I guess there is some lock in rocksdb::WriteBatch() that preventing
>> multiple OSD_OP_THREAD working concurrently, not carefully analyzed.
>
> I think it's just newstore, actually. The only thing that ever triggers a
> commit/sync is the _kv_sync_thread, which calls submit_transaction_sync(),
> and it's just one thread. On the one hand it's kind of lame to have
> this loop pushing queued transactions to disk. On the other hand it
> serves to throttle work and provide fairness with all the other IO we
> are generating.
>
> Again, I think the main limiting factor here though is going to be how
> rocksdb implements its WAL (as a file which requires 2 IOs per commit, one
> to write the data block(s) and one to update/journal the file size
> and/or allocation changes).
I'm still struck by the massive performance loss on my SSD configuration
going from 4MB IOs to 2MB IOs. SSD theoretical is around 1.7GB/s. With
4MB IOs on recent newstore we can achieve a little north of 1GB/s (ie
better than filestore!), but as soon as we drop to 2MB IOs performance
drops to 200MB/s while filestore stays around 600MB/s. The partial
object writes really hurt.
>
>> An easy way to measure might be comment out
>> db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to see if
>> we can get more QD in fragment part without issuing the DB.
>
> I'm not sure I totally understand the interface.. my assumption is that
> queue_transaction will give rocksdb the txn to commit whenever it finds it
> convenient (no idea what policy is used there) and queue_transaction_sync
> will trigger a commit now. If we did have multiple threads doing
> queue_trandsaction_sync (by, say, calling it directly in _txc_submit_kv)
> would qa go up?
>
> Thanks!
> sage
>
>
>
>
>>
>>
>> ----------------------------------------------------------Configurations---------------------------------------------------------------------------------------------------------
>> My setup is SSD based, 1 OSD, pool with 100pg and size =1. The pattern I am working on is 4KB random write(QD=8) on top of RBD(using fio-librbd).FIO configuration is:
>> bs=4k
>> iodepth=8
>> size=10g
>> iodepth_batch_submit=1
>> iodepth_batch_complete=1
>>
>> The tuning I am using are listed here, this might not be the best but already showing something.
>> rocksdb_stats_dump_period_sec = 5
>> rocksdb_max_background_compactions = 4
>> rocksdb_compaction_threads = 4
>> rocksdb_write_buffer_size = 536870912 //512MB
>> rocksdb_write_buffer_num = 4
>> rocksdb_min_write_buffer_number_to_merge = 2
>> rocksdb_level0_file_num_compaction_trigger = 4
>> rocksdb_max_bytes_for_level_base = 104857600 //100MB
>> rocksdb_target_file_size_base = 10485760 //10MB
>> rocksdb_num_levels = 3 // So the MAX_DB_SIZE would be ~10GB(100MB* 10^3), fair enough.
>> ?rocksdb_compression = none
>>
>>
>> Xiaoxi
>>
>> ?
>>
>> N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i
^ permalink raw reply [flat|nested] 9+ messages in thread
* 回复: Re: NewStore performance analysis
2015-04-20 15:39 ` Sage Weil
2015-04-20 15:55 ` Mark Nelson
@ 2015-04-20 16:11 ` Chen, Xiaoxi
[not found] ` <alpine.DEB.2.00.1504200945000.18547@cobra.newdream.net>
1 sibling, 1 reply; 9+ messages in thread
From: Chen, Xiaoxi @ 2015-04-20 16:11 UTC (permalink / raw)
To: Sage Weil
Cc: Mark Nelson, Somnath Roy, Duan, Jiangang, Zhang, Jian, ceph-devel
---- Sage Weil编写 ----
> On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
> > [Resend in plain text]
> >
> > Hi,
> > I have played some tunable on RocksDB these days, try to optimize the performance of Newstore. From the data now ,seems the WA of RocksDB is not the issue that blocking the performance, and also seems not the fragment part(aio/dio, etc). The issue might be how much OPS rocksdb can offer under 1-write-per-sync workload. I cannot find the number online so I will do it by myself, if that number is low, maybe we need holding multiple RocksDB instance in one OSD and do some sharding .
> >
> > The WAL log of Rocksdb, RocksDB data file and Newstore directory were backed by 3 separate SSDs.
> > /dev/sdc1 156172796 32928 156139868 1% /root/ceph-0-db
> > /dev/sdd1 195264572 32928 195231644 1% /root/ceph-0-db-wal
> > /dev/sdb1 156172796 10589552 145583244 7% /var/lib/ceph/osd/ceph-0
> >
> > Some interesting finds here:
> >
> > 1. Avg_reqsz in SDB(newstore FS part) is 2KB, that is half of the request block size(4KB), IOPS in iostat(2K) is ~ 2X of the number reported by FIO. BW matched
> >
> > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> > sda 0.00 0.00 0.00 58.33 0.00 28.98 1017.51 6.33 108.55 0.00 108.55 1.30 7.60
> > sdb 0.00 0.00 0.00 2038.00 0.00 3.98 4.00 0.13 0.07 0.00 0.07 0.07 13.33
> > sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> > sdd 0.00 747.67 0.00 2099.67 0.00 11.28 11.00 0.76 0.36 0.00 0.36 0.36 75.73
> >
> > I believe newstore will not split the request, so there should be some very small IO(~0KB) goes with the data write(4KB), where the small IO comes from ?
> >
> > Also checked the Filestore data, this behavior is not present in Filestore, changing the WBThrottle will affect the number. So seems this behavior is related with the flushing mechanism? In newstore we are doing fdatasync more aggressively.
>
> Yeah, it sounds like the difference is that newstore is doing immediate
> fdatasync's (on new objects or appends, and on applying post-commit wal
> items). The 2k IOs are probably the xfs journal commit?
>
> > 2. Notice that by tuning the write_buffer_size , wirte_buffer_num and
> > min_write_buffer_number_to_merge, we can make the DB write to ZERO
> >
> > Look at the iostat of SDC, actually there is almost no IO happened
> > there, that is because most of the WAL entries were merged before
> > flushing to Level0.
> >
> > Other RocksDB tuning are originally trying to optimize the compaction
> > behavior, but since there is few data written to Level0, the compaction
> > is almost unmeasurable here.
>
> This is good news. Was the overlay code being used in this case? (By
> default it should kick in for 4k writes unless you do 'newstore overlay
> max = 0' or similar. If we can confirm that our wal writes aren't being
> amplified at all that's great news.
>
No, I disabled overlay in all cases. I think with the latest commit in wip-newstore, we can capping the total amount of wal, with that I think we can caculate the write.buffer size and other tunable.
> > 3. Disable RocksDB WAL can 3X the performance(Although this is
> > definitely WRONG WAY)
> >
> > Just curious if there is no extra IO happened in DB side, what the
> > performance looks like. I turn off the WAL log of rocks DB, the
> > performance is 3x(799-2464 , lat from 10 -> 3.2)
> >
> > 4. The avg queue size is <1 in any case, both DB_WAL part and fragment
> > part.
> >
> > I guess there is some lock in rocksdb::WriteBatch() that preventing
> > multiple OSD_OP_THREAD working concurrently, not carefully analyzed.
>
> I think it's just newstore, actually. The only thing that ever triggers a
> commit/sync is the _kv_sync_thread, which calls submit_transaction_sync(),
> and it's just one thread. On the one hand it's kind of lame to have
> this loop pushing queued transactions to disk. On the other hand it
> serves to throttle work and provide fairness with all the other IO we
> are generating.
>
> Again, I think the main limiting factor here though is going to be how
> rocksdb implements its WAL (as a file which requires 2 IOs per commit, one
> to write the data block(s) and one to update/journal the file size
> and/or allocation changes).
>
That is true, have the rocksdb community plan to optimizs it?
The wal is in a seperate ssd in this case, so seems this is not the limit factor in this test.
> > An easy way to measure might be comment out
> > db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to see if
> > we can get more QD in fragment part without issuing the DB.
>
> I'm not sure I totally understand the interface.. my assumption is that
> queue_transaction will give rocksdb the txn to commit whenever it finds it
> convenient (no idea what policy is used there) and queue_transaction_sync
> will trigger a commit now. If we did have multiple threads doing
> queue_trandsaction_sync (by, say, calling it directly in _txc_submit_kv)
> would qa go up?
>
I think you might miss something, currently the two interface are exactly the SAME unless you set rocksdb-disable-sync=true(which is false by default).
When commit, rocksdb will write the content to both memtable(write buffer) and WAL. if the transaction doesnt go with sync, it will also commit now,but the write to WAL will NOT be sync(by calling fdatasync). That means we may lose data if power failure/kernel panic. This is why i changed the default rocksdb-disable-sync from true to false in previous patch.
Thanks
Xiaoxi
> Thanks!
> sage
>
>
>
>
> >
> >
> > ----------------------------------------------------------Configurations---------------------------------------------------------------------------------------------------------
> > My setup is SSD based, 1 OSD, pool with 100pg and size =1. The pattern I am working on is 4KB random write(QD=8) on top of RBD(using fio-librbd).FIO configuration is:
> > bs=4k
> > iodepth=8
> > size=10g
> > iodepth_batch_submit=1
> > iodepth_batch_complete=1
> >
> > The tuning I am using are listed here, this might not be the best but already showing something.
> > rocksdb_stats_dump_period_sec = 5
> > rocksdb_max_background_compactions = 4
> > rocksdb_compaction_threads = 4
> > rocksdb_write_buffer_size = 536870912 //512MB
> > rocksdb_write_buffer_num = 4
> > rocksdb_min_write_buffer_number_to_merge = 2
> > rocksdb_level0_file_num_compaction_trigger = 4
> > rocksdb_max_bytes_for_level_base = 104857600 //100MB
> > rocksdb_target_file_size_base = 10485760 //10MB
> > rocksdb_num_levels = 3 // So the MAX_DB_SIZE would be ~10GB(100MB* 10^3), fair enough.
> > ?rocksdb_compression = none
> >
> >
> > Xiaoxi
> >
> > ?
> >
> > N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: 回复: Re: NewStore performance analysis
[not found] ` <alpine.DEB.2.00.1504200945000.18547@cobra.newdream.net>
@ 2015-04-21 6:43 ` Chen, Xiaoxi
2015-04-21 8:51 ` Haomai Wang
0 siblings, 1 reply; 9+ messages in thread
From: Chen, Xiaoxi @ 2015-04-21 6:43 UTC (permalink / raw)
To: Sage Weil
Cc: Mark Nelson, Somnath Roy, Duan, Jiangang, Zhang, Jian, ceph-devel
Hi Sage,
Well, that's
submit_transaction -- submit a transaction , whether block waiting for fdatasync depends on rocksdb-disable-sync.
submit_transaction_sync -- queue transaction and wait until it is stable on disk.
So if we default rocksdb-disable-sync to false, the two API are same. I haven't look at the LevelDB but I suspect it's similar.
I just re-read the Newstore code, seems the workflow is not as that we want. We issue a bunch of submit_transaction and in the _kv_sync_thread we try to have a checkpoint that ensure previous transaction are persistent, by using submit_transcation_sync to submit an empty transaction. But actually
1. the submit_transaction is already a synchronized call so the empty transcation in _kv_sync_thread is kind of waste.
2. An sync transaction cannot ensure the previous transaction is also synced. The API doesn't guarantee this, and from implementation, this two transactions may goes to different WAL files.
Yes, if we want, we can have a Queue and Thread that collecting the transactions and merge them to a big transaction , some ::fdatasync will be saved here. But this approach looks complex.
Some optimizations in my mind are:
1. Batch the cleanup operations in _apply_wal_transaction, we don’t need to synchronized remove the WAL item, we can just put them into kv_sync_thread_Q and let kv_sync_thread to form a batch transaction that deleted a bunch of key.
2. We don't need the empty transaction in kv_sync_thread, we could call the _txc_kv_finish_kv directly from _txc_submit_kv, since the KV is synchronized.
3. Then we can rename _kv_sync_thread to _kv_cleanup_thread to better descript its work.
How do you think
Xiaoxi
-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Tuesday, April 21, 2015 12:48 AM
To: Chen, Xiaoxi
Cc: Mark Nelson; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel
Subject: Re: 回复: Re: NewStore performance analysis
On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
> > > An easy way to measure might be comment out
> > > db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to
> > > db->see if
> > > we can get more QD in fragment part without issuing the DB.
> >
> > I'm not sure I totally understand the interface.. my assumption is
> > that queue_transaction will give rocksdb the txn to commit whenever
> > it finds it convenient (no idea what policy is used there) and
> > queue_transaction_sync will trigger a commit now. If we did have
> > multiple threads doing queue_trandsaction_sync (by, say, calling it
> > directly in _txc_submit_kv) would qa go up?
> >
> I think you might miss something, currently the two interface are
> exactly the SAME unless you set rocksdb-disable-sync=true(which is
> false by default).
>
> When commit, rocksdb will write the content to both memtable(write
> buffer) and WAL. if the transaction doesnt go with sync, it will also
> commit now,but the write to WAL will NOT be sync(by calling fdatasync).
> That means we may lose data if power failure/kernel panic. This is why
> i changed the default rocksdb-disable-sync from true to false in
> previous patch.
Yeah, I'm confused. :)
So now 'rocksdb disable sync = false', which seems to be obviously what we want for newstore. It's different for filestore, which is doing a syncfs checkpoint. Perhaps we should have newstore set that explicitly instead of passing through a config option.
In any case, though, I'm confused by
> if the transaction doesnt go with sync, it will also commit now,but
> the write to WAL will NOT be sync(by calling fdatasync).
What does it mean to 'commit' but not call fdatasync? What does commit mean in this case?
And, and I correct in understanding that we have
queue_transaction -- queue a transaction but don't block waiting for fdatasync queue_transaction_sync -- queue transaction and wait until it is stable on disk
to work with?
Thanks!
sage
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 回复: Re: NewStore performance analysis
2015-04-21 6:43 ` Chen, Xiaoxi
@ 2015-04-21 8:51 ` Haomai Wang
[not found] ` <alpine.DEB.2.00.1504211246450.18547@cobra.ne <alpine.DEB.2.00.1504211627110.18547@cobra.newdream.net>
0 siblings, 1 reply; 9+ messages in thread
From: Haomai Wang @ 2015-04-21 8:51 UTC (permalink / raw)
To: Chen, Xiaoxi
Cc: Sage Weil, Mark Nelson, Somnath Roy, Duan, Jiangang, Zhang, Jian,
ceph-devel
On Tue, Apr 21, 2015 at 2:43 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
> Hi Sage,
> Well, that's
> submit_transaction -- submit a transaction , whether block waiting for fdatasync depends on rocksdb-disable-sync.
> submit_transaction_sync -- queue transaction and wait until it is stable on disk.
> So if we default rocksdb-disable-sync to false, the two API are same. I haven't look at the LevelDB but I suspect it's similar.
Eh, I don't think it's the same. By default WriteOption.disableWAL is
false in our ceph side, and submit_transaction will use
WriteOption.sync=false and submit_transaction_sync will use
WriteOption.sync=true.
If sync==fase, rocksdb won't sync log file, otherwise it will call
fsync/fdatasync to flush log file.
Plz correct me if not. :-)
>
> I just re-read the Newstore code, seems the workflow is not as that we want. We issue a bunch of submit_transaction and in the _kv_sync_thread we try to have a checkpoint that ensure previous transaction are persistent, by using submit_transcation_sync to submit an empty transaction. But actually
> 1. the submit_transaction is already a synchronized call so the empty transcation in _kv_sync_thread is kind of waste.
> 2. An sync transaction cannot ensure the previous transaction is also synced. The API doesn't guarantee this, and from implementation, this two transactions may goes to different WAL files.
>
> Yes, if we want, we can have a Queue and Thread that collecting the transactions and merge them to a big transaction , some ::fdatasync will be saved here. But this approach looks complex.
>
> Some optimizations in my mind are:
> 1. Batch the cleanup operations in _apply_wal_transaction, we don’t need to synchronized remove the WAL item, we can just put them into kv_sync_thread_Q and let kv_sync_thread to form a batch transaction that deleted a bunch of key.
> 2. We don't need the empty transaction in kv_sync_thread, we could call the _txc_kv_finish_kv directly from _txc_submit_kv, since the KV is synchronized.
> 3. Then we can rename _kv_sync_thread to _kv_cleanup_thread to better descript its work.
>
> How do you think
>
> Xiaoxi
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Tuesday, April 21, 2015 12:48 AM
> To: Chen, Xiaoxi
> Cc: Mark Nelson; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel
> Subject: Re: 回复: Re: NewStore performance analysis
>
> On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
>> > > An easy way to measure might be comment out
>> > > db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to
>> > > db->see if
>> > > we can get more QD in fragment part without issuing the DB.
>> >
>> > I'm not sure I totally understand the interface.. my assumption is
>> > that queue_transaction will give rocksdb the txn to commit whenever
>> > it finds it convenient (no idea what policy is used there) and
>> > queue_transaction_sync will trigger a commit now. If we did have
>> > multiple threads doing queue_trandsaction_sync (by, say, calling it
>> > directly in _txc_submit_kv) would qa go up?
>> >
>> I think you might miss something, currently the two interface are
>> exactly the SAME unless you set rocksdb-disable-sync=true(which is
>> false by default).
>>
>> When commit, rocksdb will write the content to both memtable(write
>> buffer) and WAL. if the transaction doesnt go with sync, it will also
>> commit now,but the write to WAL will NOT be sync(by calling fdatasync).
>> That means we may lose data if power failure/kernel panic. This is why
>> i changed the default rocksdb-disable-sync from true to false in
>> previous patch.
>
> Yeah, I'm confused. :)
>
> So now 'rocksdb disable sync = false', which seems to be obviously what we want for newstore. It's different for filestore, which is doing a syncfs checkpoint. Perhaps we should have newstore set that explicitly instead of passing through a config option.
>
> In any case, though, I'm confused by
>
>> if the transaction doesnt go with sync, it will also commit now,but
>> the write to WAL will NOT be sync(by calling fdatasync).
>
> What does it mean to 'commit' but not call fdatasync? What does commit mean in this case?
>
> And, and I correct in understanding that we have
>
> queue_transaction -- queue a transaction but don't block waiting for fdatasync queue_transaction_sync -- queue transaction and wait until it is stable on disk
>
> to work with?
>
> Thanks!
> sage
--
Best Regards,
Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* 回复: Re: 回复: Re: 回复: Re: NewStore performance analysis
[not found] ` <alpine.DEB.2.00.1504211627110.18547@cobra.newdream.net>
@ 2015-04-21 23:47 ` Chen, Xiaoxi
[not found] ` <alpine.DEB.2.00.1504211654560.18547@cobra.newdream.net>
0 siblings, 1 reply; 9+ messages in thread
From: Chen, Xiaoxi @ 2015-04-21 23:47 UTC (permalink / raw)
To: Sage Weil
Cc: Haomai Wang, Mark Nelson, Somnath Roy, Duan, Jiangang,
Zhang, Jian, ceph-devel
---- Sage Weil编写 ----
> On Tue, 21 Apr 2015, Chen, Xiaoxi wrote:
> > Haomai is right in theory, but I am not sure whether all
> > user(mon,filestore,kvstore) of submit_transaction API clearly holding
> > the expectation that their data is not persistent and may lost in
> > failure. So in rocksdb now the sync is default to true even in
> > submit_transaction(and this option make the two api exactly the same).
> > Maybe we need to rename the api to
> > submit_transaction_persistent/nonpersistent to better discribe the
> > behavior?
>
> Let's audit them, then.. I think they are right, but we may as well
> confirm!
>
> Again, FileStore is the odd one out here because it is relying on the
> syncfs(2) at commit time for everything.
>
Yes, so maybe we dont need to expose the option to user, we can decide whether to.sync in code logic.
I remember some folks in out team tried to move KVDB to a partition on SSD while leave other filestore data on HDD, in my memory it benifit performance. This deployment is problematic with kv_sync=false. gWill check the data first and then we can evaluate whethe we want to support this kind of deployment.
> > And yeah, whether a sync(persistent) transaction success can persist the
> > previous non-sync(unpersistent) transaction is the main issue here.
>
> Yeah. I posted to facebook, no reply yet!
>
> s
>
> >
> > ---- Sage Weil?? ----
> >
> >
> > On Tue, 21 Apr 2015, Haomai Wang wrote:
> > > On Tue, Apr 21, 2015 at 2:43 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
> > > > Hi Sage,
> > > > Well, that's
> > > > submit_transaction -- submit a transaction , whether block waiting for fdatasync depends on rocksdb-disable-sync.
> > > > submit_transaction_sync -- queue transaction and wait until it is stable on disk.
> > > > So if we default rocksdb-disable-sync to false, the two API are same. I haven't look at the LevelDB but I suspect it's similar.
> > >
> > > Eh, I don't think it's the same. By default WriteOption.disableWAL is
> > > false in our ceph side, and submit_transaction will use
> > > WriteOption.sync=false and submit_transaction_sync will use
> > > WriteOption.sync=true.
> > >
> > > If sync==fase, rocksdb won't sync log file, otherwise it will call
> > > fsync/fdatasync to flush log file.
> > >
> > > Plz correct me if not. :-)
> >
> > That's what it looks like to me, too. I think the disable sync is a
> > separate optimization for bulk data loading that only filestore wants
> > (because it calls sync(2); we should probably set the option explicitly in
> > FileStore.cc instead of exposing as a ceph option?).
> >
> > I think the issue with what we have now is that a
> > submit_transaction_sync() with an empty transaction may not sync previous
> > transactions if the log rolled over? I would really expect that it would,
> > though... :/ I'll ask on the rocksdb facebook page.
> >
> > sage
> >
> >
> > >
> > > >
> > > > I just re-read the Newstore code, seems the workflow is not as that we want. We issue a bunch of submit_transaction and in the _kv_sync_thread we try to have a checkpoint that ensure previous transaction are persistent, by using submit_transcation_sync to submit an empty transaction. But actually
> > > > 1. the submit_transaction is already a synchronized call so the empty transcation in _kv_sync_thread is kind of waste.
> > > > 2. An sync transaction cannot ensure the previous transaction is also synced. The API doesn't guarantee this, and from implementation, this two transactions may goes to different WAL files.
> > > >
> > > > Yes, if we want, we can have a Queue and Thread that collecting the transactions and merge them to a big transaction , some ::fdatasync will be saved here. But this approach looks complex.
> > > >
> > > > Some optimizations in my mind are:
> > > > 1. Batch the cleanup operations in _apply_wal_transaction, we don?t need to synchronized remove the WAL item, we can just put them into kv_sync_thread_Q and let kv_sync_thread to form a batch transaction that deleted a bunch of key.
> > > > 2. We don't need the empty transaction in kv_sync_thread, we could call the _txc_kv_finish_kv directly from _txc_submit_kv, since the KV is synchronized.
> > > > 3. Then we can rename _kv_sync_thread to _kv_cleanup_thread to better descript its work.
> > > >
> > > > How do you think
> > > >
> > > > Xiaoxi
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Tuesday, April 21, 2015 12:48 AM
> > > > To: Chen, Xiaoxi
> > > > Cc: Mark Nelson; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel
> > > > Subject: Re: ??: Re: NewStore performance analysis
> > > >
> > > > On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
> > > >> > > An easy way to measure might be comment out
> > > >> > > db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to
> > > >> > > db->see if
> > > >> > > we can get more QD in fragment part without issuing the DB.
> > > >> >
> > > >> > I'm not sure I totally understand the interface.. my assumption is
> > > >> > that queue_transaction will give rocksdb the txn to commit whenever
> > > >> > it finds it convenient (no idea what policy is used there) and
> > > >> > queue_transaction_sync will trigger a commit now. If we did have
> > > >> > multiple threads doing queue_trandsaction_sync (by, say, calling it
> > > >> > directly in _txc_submit_kv) would qa go up?
> > > >> >
> > > >> I think you might miss something, currently the two interface are
> > > >> exactly the SAME unless you set rocksdb-disable-sync=true(which is
> > > >> false by default).
> > > >>
> > > >> When commit, rocksdb will write the content to both memtable(write
> > > >> buffer) and WAL. if the transaction doesnt go with sync, it will also
> > > >> commit now,but the write to WAL will NOT be sync(by calling fdatasync).
> > > >> That means we may lose data if power failure/kernel panic. This is why
> > > >> i changed the default rocksdb-disable-sync from true to false in
> > > >> previous patch.
> > > >
> > > > Yeah, I'm confused. :)
> > > >
> > > > So now 'rocksdb disable sync = false', which seems to be obviously what we want for newstore. It's different for filestore, which is doing a syncfs checkpoint. Perhaps we should have newstore set that explicitly instead of passing through a config option.
> > > >
> > > > In any case, though, I'm confused by
> > > >
> > > >> if the transaction doesnt go with sync, it will also commit now,but
> > > >> the write to WAL will NOT be sync(by calling fdatasync).
> > > >
> > > > What does it mean to 'commit' but not call fdatasync? What does commit mean in this case?
> > > >
> > > > And, and I correct in understanding that we have
> > > >
> > > > queue_transaction -- queue a transaction but don't block waiting for fdatasync queue_transaction_sync -- queue transaction and wait until it is stable on disk
> > > >
> > > > to work with?
> > > >
> > > > Thanks!
> > > > sage
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > >
> > > Wheat
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
> >
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 回复: Re: 回复: Re: 回复: Re: NewStore performance analysis
[not found] ` <alpine.DEB.2.00.1504211654560.18547@cobra.newdream.net>
@ 2015-04-21 23:59 ` Mark Nelson
2015-04-22 3:34 ` Chen, Xiaoxi
0 siblings, 1 reply; 9+ messages in thread
From: Mark Nelson @ 2015-04-21 23:59 UTC (permalink / raw)
To: Sage Weil, Chen, Xiaoxi
Cc: Haomai Wang, Somnath Roy, Duan, Jiangang, Zhang, Jian, ceph-devel
On 04/21/2015 06:57 PM, Sage Weil wrote:
> On Tue, 21 Apr 2015, Chen, Xiaoxi wrote:
>> ---- Sage Weil?? ----
>>
>>> On Tue, 21 Apr 2015, Chen, Xiaoxi wrote:
>>>> Haomai is right in theory, but I am not sure whether all
>>>> user(mon,filestore,kvstore) of submit_transaction API clearly holding
>>>> the expectation that their data is not persistent and may lost in
>>>> failure. So in rocksdb now the sync is default to true even in
>>>> submit_transaction(and this option make the two api exactly the same).
>>>> Maybe we need to rename the api to
>>>> submit_transaction_persistent/nonpersistent to better discribe the
>>>> behavior?
>>>
>>> Let's audit them, then.. I think they are right, but we may as well
>>> confirm!
>>>
>>> Again, FileStore is the odd one out here because it is relying on the
>>> syncfs(2) at commit time for everything.
>>>
>>
>> Yes, so maybe we dont need to expose the option to user, we can decide
>> whether to.sync in code logic.
>
> Yeah, I think it'll reduce confusion too. I suggest we do a pull request
> against master that does this... let me know if you want to do it,
> otherwise I will!
>
>> I remember some folks in out team tried to move KVDB to a partition on
>> SSD while leave other filestore data on HDD, in my memory it benifit
>> performance. This deployment is problematic with kv_sync=false. gWill
>> check the data first and then we can evaluate whethe we want to support
>> this kind of deployment.
>
> We could detect this by doing a stat(2) on the current/omap/ vs current/
> dirs and checking if it's a different file system. If so, we can do the
> syncfs(2) on both dirs. The btrfs case would probably not be practical,
> but we can error out in that case. But yeah not sure how important it
> would be to support this since filestore doesn't use leveldb that
> heavily... and I'd prefer to limit our investment of time there if we can
> instead make newstore (or something else) better.
FWIW, the last time I tried putting leveldb on SSD didn't really help at
all. It's been a while so maybe that's changed, but newstore definitely
seems like the way forward to me.
Mark
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: 回复: Re: 回复: Re: 回复: Re: NewStore performance analysis
2015-04-21 23:59 ` Mark Nelson
@ 2015-04-22 3:34 ` Chen, Xiaoxi
0 siblings, 0 replies; 9+ messages in thread
From: Chen, Xiaoxi @ 2015-04-22 3:34 UTC (permalink / raw)
To: Mark Nelson, Sage Weil
Cc: Haomai Wang, Somnath Roy, Duan, Jiangang, Zhang, Jian, ceph-devel,
Xue, Chendi
Hi Sage and Mark,
Chendi from our team had done the test based on v0.91. The setup is 4 nodes, totally 40HDDs with SSDs as journal, replica=2
Mount a partition from journal SSD to /current/omap benefit 4K random write IOPS(peak) from 1524 to 2694, that's 76% while other IO patterns keep the same.
Some details are here.
If this can reproduce in other setup, I suspect it worth us to investigate some time to do the detection.
Runid OP_SIZE OP_TYPE QD Engine server_num client_num rbd_num RBD_FIO_IOPS RBD_FIO_BW RBD_FIO_Latency osd_read_iops osd_write_iops osd_read_bw
Prev 305 4k randwrite qd8 vdb 4 2 40 1524 6170.1 209.3851 7.862196 7677.648 0.446566 54.916435
Omap2ssd 320 4k randwrite qd8 vdb 4 2 40 2694 10864.23 119.4587 322.4334 10930 1.409266 71.33833
Xiaoxi
-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com]
Sent: Wednesday, April 22, 2015 7:59 AM
To: Sage Weil; Chen, Xiaoxi
Cc: Haomai Wang; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel
Subject: Re: 回复: Re: 回复: Re: 回复: Re: NewStore performance analysis
On 04/21/2015 06:57 PM, Sage Weil wrote:
> On Tue, 21 Apr 2015, Chen, Xiaoxi wrote:
>> ---- Sage Weil?? ----
>>
>>> On Tue, 21 Apr 2015, Chen, Xiaoxi wrote:
>>>> Haomai is right in theory, but I am not sure whether all
>>>> user(mon,filestore,kvstore) of submit_transaction API clearly
>>>> holding the expectation that their data is not persistent and may
>>>> lost in failure. So in rocksdb now the sync is default to true
>>>> even in submit_transaction(and this option make the two api exactly the same).
>>>> Maybe we need to rename the api to
>>>> submit_transaction_persistent/nonpersistent to better discribe the
>>>> behavior?
>>>
>>> Let's audit them, then.. I think they are right, but we may as well
>>> confirm!
>>>
>>> Again, FileStore is the odd one out here because it is relying on
>>> the
>>> syncfs(2) at commit time for everything.
>>>
>>
>> Yes, so maybe we dont need to expose the option to user, we can
>> decide whether to.sync in code logic.
>
> Yeah, I think it'll reduce confusion too. I suggest we do a pull
> request against master that does this... let me know if you want to do
> it, otherwise I will!
>
>> I remember some folks in out team tried to move KVDB to a partition
>> on SSD while leave other filestore data on HDD, in my memory it
>> benifit performance. This deployment is problematic with
>> kv_sync=false. gWill check the data first and then we can evaluate
>> whethe we want to support this kind of deployment.
>
> We could detect this by doing a stat(2) on the current/omap/ vs
> current/ dirs and checking if it's a different file system. If so, we
> can do the
> syncfs(2) on both dirs. The btrfs case would probably not be
> practical, but we can error out in that case. But yeah not sure how
> important it would be to support this since filestore doesn't use
> leveldb that heavily... and I'd prefer to limit our investment of time
> there if we can instead make newstore (or something else) better.
FWIW, the last time I tried putting leveldb on SSD didn't really help at all. It's been a while so maybe that's changed, but newstore definitely seems like the way forward to me.
Mark
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-04-22 3:34 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-20 14:59 NewStore performance analysis Chen, Xiaoxi
2015-04-20 15:39 ` Sage Weil
2015-04-20 15:55 ` Mark Nelson
2015-04-20 16:11 ` 回复: " Chen, Xiaoxi
[not found] ` <alpine.DEB.2.00.1504200945000.18547@cobra.newdream.net>
2015-04-21 6:43 ` Chen, Xiaoxi
2015-04-21 8:51 ` Haomai Wang
[not found] ` <alpine.DEB.2.00.1504211246450.18547@cobra.ne <alpine.DEB.2.00.1504211627110.18547@cobra.newdream.net>
[not found] ` <alpine.DEB.2.00.1504211627110.18547@cobra.newdream.net>
2015-04-21 23:47 ` 回复: Re: 回复: " Chen, Xiaoxi
[not found] ` <alpine.DEB.2.00.1504211654560.18547@cobra.newdream.net>
2015-04-21 23:59 ` Mark Nelson
2015-04-22 3:34 ` Chen, Xiaoxi
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.