bluestore onode diet and encoding overhead

All of lore.kernel.org
 help / color / mirror / Atom feed

* bluestore onode diet and encoding overhead
@ 2016-07-12  7:03 Mark Nelson
  2016-07-12  7:13 ` Somnath Roy
  2016-07-12 15:20 ` Allen Samuels
  0 siblings, 2 replies; 39+ messages in thread
From: Mark Nelson @ 2016-07-12  7:03 UTC (permalink / raw)
  To: ceph-devel

Hi All,

With Igor's patch last week I was able to get some bluestore performance 
runs in without segfaulting and started looking int the results. 
Somewhere along the line we really screwed up read performance, but 
that's another topic.  Right now I want to focus on random writes. 
Before we put the onode on a diet we were seeing massive amounts of read 
traffic in RocksDB during compaction that caused write stalls during 4K 
random writes.  Random write performance on fast hardware like NVMe 
devices was often below filestore at anything other than very large IO 
sizes.  This was largely due to the size of the onode compounded with 
RocksDB's tendency toward read and write amplification.

The new test results look very promising.  We've dramatically improved 
performance of random writes at most IO sizes, so that they are now 
typically quite a bit higher than both filestore and older bluestore 
code.  Unfortunately for very small IO sizes performance hasn't improved 
much.  We are no longer seeing huge amounts of RocksDB read traffic and 
fewer write stalls.  We are however seeing huge memory usage (~9GB RSS 
per OSD) and very high CPU usage.  I think this confirms some of the 
memory issues somnath was continuing to see.  I don't think it's a leak 
exactly based on how the OSDs were behaving, but we need to run through 
massif still to be sure.

I ended up spending some time tonight with perf and digging through the 
encode code.  I wrote up some notes with graphs and code snippets and 
decided to put them up on the web.  Basically some of the encoding 
changes we implemented last month to reduce the onode size also appear 
to result in more buffer::list appends and the associated overhead. 
I've been trying to think through ways to improve the situation and 
thought other people might have some ideas too.  Here's a link to the 
short writeup:

https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?usp=sharing

Thanks,
Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-12  7:03 bluestore onode diet and encoding overhead Mark Nelson
@ 2016-07-12  7:13 ` Somnath Roy
  2016-07-12 12:34   ` Mark Nelson
  2016-07-12 15:20 ` Allen Samuels
  1 sibling, 1 reply; 39+ messages in thread
From: Somnath Roy @ 2016-07-12  7:13 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel

Thanks Mark !
Yes, quite similar result I am also seeing for 4K RW. BTW, did you get chance to try out the rocksdb tuning I posted earlier ? It may reduce the stalls in your environment.

Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, July 12, 2016 12:03 AM
To: ceph-devel
Subject: bluestore onode diet and encoding overhead

Hi All,

With Igor's patch last week I was able to get some bluestore performance runs in without segfaulting and started looking int the results.
Somewhere along the line we really screwed up read performance, but that's another topic.  Right now I want to focus on random writes.
Before we put the onode on a diet we were seeing massive amounts of read traffic in RocksDB during compaction that caused write stalls during 4K random writes.  Random write performance on fast hardware like NVMe devices was often below filestore at anything other than very large IO sizes.  This was largely due to the size of the onode compounded with RocksDB's tendency toward read and write amplification.

The new test results look very promising.  We've dramatically improved performance of random writes at most IO sizes, so that they are now typically quite a bit higher than both filestore and older bluestore code.  Unfortunately for very small IO sizes performance hasn't improved much.  We are no longer seeing huge amounts of RocksDB read traffic and fewer write stalls.  We are however seeing huge memory usage (~9GB RSS per OSD) and very high CPU usage.  I think this confirms some of the memory issues somnath was continuing to see.  I don't think it's a leak exactly based on how the OSDs were behaving, but we need to run through massif still to be sure.

I ended up spending some time tonight with perf and digging through the encode code.  I wrote up some notes with graphs and code snippets and decided to put them up on the web.  Basically some of the encoding changes we implemented last month to reduce the onode size also appear to result in more buffer::list appends and the associated overhead.
I've been trying to think through ways to improve the situation and thought other people might have some ideas too.  Here's a link to the short writeup:

https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?usp=sharing

Thanks,
Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-12  7:13 ` Somnath Roy
@ 2016-07-12 12:34   ` Mark Nelson
  2016-07-12 12:40     ` Igor Fedotov
  0 siblings, 1 reply; 39+ messages in thread
From: Mark Nelson @ 2016-07-12 12:34 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

Hi Somnath,

I accidentally screwed up my first run with your settings but reran last 
night.  With your tuning the OSDs are failing to allocate to bdev0 after 
about 30 minutes of testing:

2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed to 
allocate 1048576 on bdev 0, free 0; fallback to bdev 1

They are able to continue running, but ultimately this leads to an 
assert later on.  I wonder if it's not compacting fast enough and ends 
up consuming the entire disk with stale metadata.

2016-07-12 04:31:02.631982 7f0cef8b7700 -1 
/home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In function 
'int BlueFS::_allocate(unsigned int, uint64_t, 
std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12 
04:31:02.627138
/home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398: FAILED 
assert(0 == "allocate failed... wtf")

  ceph version v10.0.4-6936-gc7da2f7 
(c7da2f7c869694246650a9276a2b67aed9bf818f)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x85) [0xd4cb75]
  2: (BlueFS::_allocate(unsigned int, unsigned long, 
std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x760) 
[0xb98220]
  3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
  4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
  5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
  6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned 
long, bool)+0x1456) [0xbfdb96]
  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
  9: 
(RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x6b) 
[0xb3df2b]
  10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
  11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
  12: (()+0x7dc5) [0x7f0d185c4dc5]
  13: (clone()+0x6d) [0x7f0d164bf28d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.


On 07/12/2016 02:13 AM, Somnath Roy wrote:
> Thanks Mark !
> Yes, quite similar result I am also seeing for 4K RW. BTW, did you get chance to try out the rocksdb tuning I posted earlier ? It may reduce the stalls in your environment.
>
> Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, July 12, 2016 12:03 AM
> To: ceph-devel
> Subject: bluestore onode diet and encoding overhead
>
> Hi All,
>
> With Igor's patch last week I was able to get some bluestore performance runs in without segfaulting and started looking int the results.
> Somewhere along the line we really screwed up read performance, but that's another topic.  Right now I want to focus on random writes.
> Before we put the onode on a diet we were seeing massive amounts of read traffic in RocksDB during compaction that caused write stalls during 4K random writes.  Random write performance on fast hardware like NVMe devices was often below filestore at anything other than very large IO sizes.  This was largely due to the size of the onode compounded with RocksDB's tendency toward read and write amplification.
>
> The new test results look very promising.  We've dramatically improved performance of random writes at most IO sizes, so that they are now typically quite a bit higher than both filestore and older bluestore code.  Unfortunately for very small IO sizes performance hasn't improved much.  We are no longer seeing huge amounts of RocksDB read traffic and fewer write stalls.  We are however seeing huge memory usage (~9GB RSS per OSD) and very high CPU usage.  I think this confirms some of the memory issues somnath was continuing to see.  I don't think it's a leak exactly based on how the OSDs were behaving, but we need to run through massif still to be sure.
>
> I ended up spending some time tonight with perf and digging through the encode code.  I wrote up some notes with graphs and code snippets and decided to put them up on the web.  Basically some of the encoding changes we implemented last month to reduce the onode size also appear to result in more buffer::list appends and the associated overhead.
> I've been trying to think through ways to improve the situation and thought other people might have some ideas too.  Here's a link to the short writeup:
>
> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?usp=sharing
>
> Thanks,
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-12 12:34   ` Mark Nelson
@ 2016-07-12 12:40     ` Igor Fedotov
  2016-07-12 12:47       ` Varada Kari
  2016-07-12 12:48       ` Mark Nelson
  0 siblings, 2 replies; 39+ messages in thread
From: Igor Fedotov @ 2016-07-12 12:40 UTC (permalink / raw)
  To: Mark Nelson, Somnath Roy, ceph-devel

That's similar to what I have while running my test case with vstart... 
Without Somnath's settings though..


On 12.07.2016 15:34, Mark Nelson wrote:
> Hi Somnath,
>
> I accidentally screwed up my first run with your settings but reran 
> last night.  With your tuning the OSDs are failing to allocate to 
> bdev0 after about 30 minutes of testing:
>
> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed to 
> allocate 1048576 on bdev 0, free 0; fallback to bdev 1
>
> They are able to continue running, but ultimately this leads to an 
> assert later on.  I wonder if it's not compacting fast enough and ends 
> up consuming the entire disk with stale metadata.
>
> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1 
> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In function 
> 'int BlueFS::_allocate(unsigned int, uint64_t, 
> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12 
> 04:31:02.627138
> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398: FAILED 
> assert(0 == "allocate failed... wtf")
>
>  ceph version v10.0.4-6936-gc7da2f7 
> (c7da2f7c869694246650a9276a2b67aed9bf818f)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x85) [0xd4cb75]
>  2: (BlueFS::_allocate(unsigned int, unsigned long, 
> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> 
> >*)+0x760) [0xb98220]
>  3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
>  4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
>  5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
>  6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, 
> unsigned long, bool)+0x1456) [0xbfdb96]
>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
>  9: 
> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x6b) 
> [0xb3df2b]
>  10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
>  11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
>  12: (()+0x7dc5) [0x7f0d185c4dc5]
>  13: (clone()+0x6d) [0x7f0d164bf28d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
>
>
> On 07/12/2016 02:13 AM, Somnath Roy wrote:
>> Thanks Mark !
>> Yes, quite similar result I am also seeing for 4K RW. BTW, did you 
>> get chance to try out the rocksdb tuning I posted earlier ? It may 
>> reduce the stalls in your environment.
>>
>> Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Tuesday, July 12, 2016 12:03 AM
>> To: ceph-devel
>> Subject: bluestore onode diet and encoding overhead
>>
>> Hi All,
>>
>> With Igor's patch last week I was able to get some bluestore 
>> performance runs in without segfaulting and started looking int the 
>> results.
>> Somewhere along the line we really screwed up read performance, but 
>> that's another topic.  Right now I want to focus on random writes.
>> Before we put the onode on a diet we were seeing massive amounts of 
>> read traffic in RocksDB during compaction that caused write stalls 
>> during 4K random writes.  Random write performance on fast hardware 
>> like NVMe devices was often below filestore at anything other than 
>> very large IO sizes.  This was largely due to the size of the onode 
>> compounded with RocksDB's tendency toward read and write amplification.
>>
>> The new test results look very promising.  We've dramatically 
>> improved performance of random writes at most IO sizes, so that they 
>> are now typically quite a bit higher than both filestore and older 
>> bluestore code.  Unfortunately for very small IO sizes performance 
>> hasn't improved much.  We are no longer seeing huge amounts of 
>> RocksDB read traffic and fewer write stalls.  We are however seeing 
>> huge memory usage (~9GB RSS per OSD) and very high CPU usage.  I 
>> think this confirms some of the memory issues somnath was continuing 
>> to see.  I don't think it's a leak exactly based on how the OSDs were 
>> behaving, but we need to run through massif still to be sure.
>>
>> I ended up spending some time tonight with perf and digging through 
>> the encode code.  I wrote up some notes with graphs and code snippets 
>> and decided to put them up on the web.  Basically some of the 
>> encoding changes we implemented last month to reduce the onode size 
>> also appear to result in more buffer::list appends and the associated 
>> overhead.
>> I've been trying to think through ways to improve the situation and 
>> thought other people might have some ideas too.  Here's a link to the 
>> short writeup:
>>
>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?usp=sharing 
>>
>>
>> Thanks,
>> Mark
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail 
>> message is intended only for the use of the designated recipient(s) 
>> named above. If the reader of this message is not the intended 
>> recipient, you are hereby notified that you have received this 
>> message in error and that any review, dissemination, distribution, or 
>> copying of this message is strictly prohibited. If you have received 
>> this communication in error, please notify the sender by telephone or 
>> e-mail (as shown above) immediately and destroy any and all copies of 
>> this message in your possession (whether hard copies or 
>> electronically stored copies).
>> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml= 
>>
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-12 12:40     ` Igor Fedotov
@ 2016-07-12 12:47       ` Varada Kari
  2016-07-12 12:48       ` Mark Nelson
  1 sibling, 0 replies; 39+ messages in thread
From: Varada Kari @ 2016-07-12 12:47 UTC (permalink / raw)
  To: Igor Fedotov, Mark Nelson, Somnath Roy, ceph-devel

We reserve the space before actual allocation, that seems to go fine
here, but when it comes to allocating the blocks we are going out space.
I am trying to reproducing the same problem. will update if i have some
findings.

Varada

On Tuesday 12 July 2016 06:10 PM, Igor Fedotov wrote:
> That's similar to what I have while running my test case with vstart...
> Without Somnath's settings though..
>
>
> On 12.07.2016 15:34, Mark Nelson wrote:
>> Hi Somnath,
>>
>> I accidentally screwed up my first run with your settings but reran
>> last night.  With your tuning the OSDs are failing to allocate to
>> bdev0 after about 30 minutes of testing:
>>
>> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed to
>> allocate 1048576 on bdev 0, free 0; fallback to bdev 1
>>
>> They are able to continue running, but ultimately this leads to an
>> assert later on.  I wonder if it's not compacting fast enough and ends
>> up consuming the entire disk with stale metadata.
>>
>> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1
>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In function
>> 'int BlueFS::_allocate(unsigned int, uint64_t,
>> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12
>> 04:31:02.627138
>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398: FAILED
>> assert(0 == "allocate failed... wtf")
>>
>>  ceph version v10.0.4-6936-gc7da2f7
>> (c7da2f7c869694246650a9276a2b67aed9bf818f)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x85) [0xd4cb75]
>>  2: (BlueFS::_allocate(unsigned int, unsigned long,
>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t>
>>> *)+0x760) [0xb98220]
>>  3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
>>  4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
>>  5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
>>  6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
>>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
>> unsigned long, bool)+0x1456) [0xbfdb96]
>>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
>>  9:
>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x6b)
>> [0xb3df2b]
>>  10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
>>  11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
>>  12: (()+0x7dc5) [0x7f0d185c4dc5]
>>  13: (clone()+0x6d) [0x7f0d164bf28d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>>
>>
>> On 07/12/2016 02:13 AM, Somnath Roy wrote:
>>> Thanks Mark !
>>> Yes, quite similar result I am also seeing for 4K RW. BTW, did you
>>> get chance to try out the rocksdb tuning I posted earlier ? It may
>>> reduce the stalls in your environment.
>>>
>>> Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Tuesday, July 12, 2016 12:03 AM
>>> To: ceph-devel
>>> Subject: bluestore onode diet and encoding overhead
>>>
>>> Hi All,
>>>
>>> With Igor's patch last week I was able to get some bluestore
>>> performance runs in without segfaulting and started looking int the
>>> results.
>>> Somewhere along the line we really screwed up read performance, but
>>> that's another topic.  Right now I want to focus on random writes.
>>> Before we put the onode on a diet we were seeing massive amounts of
>>> read traffic in RocksDB during compaction that caused write stalls
>>> during 4K random writes.  Random write performance on fast hardware
>>> like NVMe devices was often below filestore at anything other than
>>> very large IO sizes.  This was largely due to the size of the onode
>>> compounded with RocksDB's tendency toward read and write amplification.
>>>
>>> The new test results look very promising.  We've dramatically
>>> improved performance of random writes at most IO sizes, so that they
>>> are now typically quite a bit higher than both filestore and older
>>> bluestore code.  Unfortunately for very small IO sizes performance
>>> hasn't improved much.  We are no longer seeing huge amounts of
>>> RocksDB read traffic and fewer write stalls.  We are however seeing
>>> huge memory usage (~9GB RSS per OSD) and very high CPU usage.  I
>>> think this confirms some of the memory issues somnath was continuing
>>> to see.  I don't think it's a leak exactly based on how the OSDs were
>>> behaving, but we need to run through massif still to be sure.
>>>
>>> I ended up spending some time tonight with perf and digging through
>>> the encode code.  I wrote up some notes with graphs and code snippets
>>> and decided to put them up on the web.  Basically some of the
>>> encoding changes we implemented last month to reduce the onode size
>>> also appear to result in more buffer::list appends and the associated
>>> overhead.
>>> I've been trying to think through ways to improve the situation and
>>> thought other people might have some ideas too.  Here's a link to the
>>> short writeup:
>>>
>>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?usp=sharing
>>>
>>>
>>> Thanks,
>>> Mark
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>> PLEASE NOTE: The information contained in this electronic mail
>>> message is intended only for the use of the designated recipient(s)
>>> named above. If the reader of this message is not the intended
>>> recipient, you are hereby notified that you have received this
>>> message in error and that any review, dissemination, distribution, or
>>> copying of this message is strictly prohibited. If you have received
>>> this communication in error, please notify the sender by telephone or
>>> e-mail (as shown above) immediately and destroy any and all copies of
>>> this message in your possession (whether hard copies or
>>> electronically stored copies).
>>> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+�����ݢj"��!tml=
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-12 12:40     ` Igor Fedotov
  2016-07-12 12:47       ` Varada Kari
@ 2016-07-12 12:48       ` Mark Nelson
  2016-07-12 12:57         ` Igor Fedotov
  1 sibling, 1 reply; 39+ messages in thread
From: Mark Nelson @ 2016-07-12 12:48 UTC (permalink / raw)
  To: Igor Fedotov, Somnath Roy, ceph-devel

Oh, that's good to know!  Have you tracked it down at all?  I noticed 
pretty extreme memory usage on the OSDs still, so that might be part of 
it.  I'm doing a massif run now.

Mark

On 07/12/2016 07:40 AM, Igor Fedotov wrote:
> That's similar to what I have while running my test case with vstart...
> Without Somnath's settings though..
>
>
> On 12.07.2016 15:34, Mark Nelson wrote:
>> Hi Somnath,
>>
>> I accidentally screwed up my first run with your settings but reran
>> last night.  With your tuning the OSDs are failing to allocate to
>> bdev0 after about 30 minutes of testing:
>>
>> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed to
>> allocate 1048576 on bdev 0, free 0; fallback to bdev 1
>>
>> They are able to continue running, but ultimately this leads to an
>> assert later on.  I wonder if it's not compacting fast enough and ends
>> up consuming the entire disk with stale metadata.
>>
>> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1
>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In function
>> 'int BlueFS::_allocate(unsigned int, uint64_t,
>> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12
>> 04:31:02.627138
>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398: FAILED
>> assert(0 == "allocate failed... wtf")
>>
>>  ceph version v10.0.4-6936-gc7da2f7
>> (c7da2f7c869694246650a9276a2b67aed9bf818f)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x85) [0xd4cb75]
>>  2: (BlueFS::_allocate(unsigned int, unsigned long,
>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t>
>> >*)+0x760) [0xb98220]
>>  3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
>>  4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
>>  5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
>>  6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
>>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
>> unsigned long, bool)+0x1456) [0xbfdb96]
>>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
>>  9:
>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x6b)
>> [0xb3df2b]
>>  10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
>>  11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
>>  12: (()+0x7dc5) [0x7f0d185c4dc5]
>>  13: (clone()+0x6d) [0x7f0d164bf28d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>>
>>
>> On 07/12/2016 02:13 AM, Somnath Roy wrote:
>>> Thanks Mark !
>>> Yes, quite similar result I am also seeing for 4K RW. BTW, did you
>>> get chance to try out the rocksdb tuning I posted earlier ? It may
>>> reduce the stalls in your environment.
>>>
>>> Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Tuesday, July 12, 2016 12:03 AM
>>> To: ceph-devel
>>> Subject: bluestore onode diet and encoding overhead
>>>
>>> Hi All,
>>>
>>> With Igor's patch last week I was able to get some bluestore
>>> performance runs in without segfaulting and started looking int the
>>> results.
>>> Somewhere along the line we really screwed up read performance, but
>>> that's another topic.  Right now I want to focus on random writes.
>>> Before we put the onode on a diet we were seeing massive amounts of
>>> read traffic in RocksDB during compaction that caused write stalls
>>> during 4K random writes.  Random write performance on fast hardware
>>> like NVMe devices was often below filestore at anything other than
>>> very large IO sizes.  This was largely due to the size of the onode
>>> compounded with RocksDB's tendency toward read and write amplification.
>>>
>>> The new test results look very promising.  We've dramatically
>>> improved performance of random writes at most IO sizes, so that they
>>> are now typically quite a bit higher than both filestore and older
>>> bluestore code.  Unfortunately for very small IO sizes performance
>>> hasn't improved much.  We are no longer seeing huge amounts of
>>> RocksDB read traffic and fewer write stalls.  We are however seeing
>>> huge memory usage (~9GB RSS per OSD) and very high CPU usage.  I
>>> think this confirms some of the memory issues somnath was continuing
>>> to see.  I don't think it's a leak exactly based on how the OSDs were
>>> behaving, but we need to run through massif still to be sure.
>>>
>>> I ended up spending some time tonight with perf and digging through
>>> the encode code.  I wrote up some notes with graphs and code snippets
>>> and decided to put them up on the web.  Basically some of the
>>> encoding changes we implemented last month to reduce the onode size
>>> also appear to result in more buffer::list appends and the associated
>>> overhead.
>>> I've been trying to think through ways to improve the situation and
>>> thought other people might have some ideas too.  Here's a link to the
>>> short writeup:
>>>
>>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?usp=sharing
>>>
>>>
>>> Thanks,
>>> Mark
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>> PLEASE NOTE: The information contained in this electronic mail
>>> message is intended only for the use of the designated recipient(s)
>>> named above. If the reader of this message is not the intended
>>> recipient, you are hereby notified that you have received this
>>> message in error and that any review, dissemination, distribution, or
>>> copying of this message is strictly prohibited. If you have received
>>> this communication in error, please notify the sender by telephone or
>>> e-mail (as shown above) immediately and destroy any and all copies of
>>> this message in your possession (whether hard copies or
>>> electronically stored copies).
>>> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-12 12:48       ` Mark Nelson
@ 2016-07-12 12:57         ` Igor Fedotov
  2016-07-12 13:02           ` Mark Nelson
  0 siblings, 1 reply; 39+ messages in thread
From: Igor Fedotov @ 2016-07-12 12:57 UTC (permalink / raw)
  To: Mark Nelson, Somnath Roy, ceph-devel

Mark,

you can find my post named 'yet another assertion in bluestore during 
random write' last week. It contains steps to reproduce in my case.

Also I did some investigations (still incomplete though) with tuning 
'bluestore block db size' and 'bluestore block wal size'. Setting both 
to 256M fixes the issue for me.

But I'm still uncertain if that's a bug or just inappropriate settings...


Thanks,

Igor


On 12.07.2016 15:48, Mark Nelson wrote:
> Oh, that's good to know!  Have you tracked it down at all?  I noticed 
> pretty extreme memory usage on the OSDs still, so that might be part 
> of it.  I'm doing a massif run now.
>
> Mark
>
> On 07/12/2016 07:40 AM, Igor Fedotov wrote:
>> That's similar to what I have while running my test case with vstart...
>> Without Somnath's settings though..
>>
>>
>> On 12.07.2016 15:34, Mark Nelson wrote:
>>> Hi Somnath,
>>>
>>> I accidentally screwed up my first run with your settings but reran
>>> last night.  With your tuning the OSDs are failing to allocate to
>>> bdev0 after about 30 minutes of testing:
>>>
>>> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed to
>>> allocate 1048576 on bdev 0, free 0; fallback to bdev 1
>>>
>>> They are able to continue running, but ultimately this leads to an
>>> assert later on.  I wonder if it's not compacting fast enough and ends
>>> up consuming the entire disk with stale metadata.
>>>
>>> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1
>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In function
>>> 'int BlueFS::_allocate(unsigned int, uint64_t,
>>> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12
>>> 04:31:02.627138
>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398: FAILED
>>> assert(0 == "allocate failed... wtf")
>>>
>>>  ceph version v10.0.4-6936-gc7da2f7
>>> (c7da2f7c869694246650a9276a2b67aed9bf818f)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x85) [0xd4cb75]
>>>  2: (BlueFS::_allocate(unsigned int, unsigned long,
>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t>
>>> >*)+0x760) [0xb98220]
>>>  3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
>>>  4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
>>>  5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
>>>  6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
>>>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
>>> unsigned long, bool)+0x1456) [0xbfdb96]
>>>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>>> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
>>>  9:
>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x6b) 
>>>
>>> [0xb3df2b]
>>>  10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
>>>  11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
>>>  12: (()+0x7dc5) [0x7f0d185c4dc5]
>>>  13: (clone()+0x6d) [0x7f0d164bf28d]
>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>> needed to interpret this.
>>>
>>>
>>> On 07/12/2016 02:13 AM, Somnath Roy wrote:
>>>> Thanks Mark !
>>>> Yes, quite similar result I am also seeing for 4K RW. BTW, did you
>>>> get chance to try out the rocksdb tuning I posted earlier ? It may
>>>> reduce the stalls in your environment.
>>>>
>>>> Regards
>>>> Somnath
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>> Sent: Tuesday, July 12, 2016 12:03 AM
>>>> To: ceph-devel
>>>> Subject: bluestore onode diet and encoding overhead
>>>>
>>>> Hi All,
>>>>
>>>> With Igor's patch last week I was able to get some bluestore
>>>> performance runs in without segfaulting and started looking int the
>>>> results.
>>>> Somewhere along the line we really screwed up read performance, but
>>>> that's another topic.  Right now I want to focus on random writes.
>>>> Before we put the onode on a diet we were seeing massive amounts of
>>>> read traffic in RocksDB during compaction that caused write stalls
>>>> during 4K random writes.  Random write performance on fast hardware
>>>> like NVMe devices was often below filestore at anything other than
>>>> very large IO sizes.  This was largely due to the size of the onode
>>>> compounded with RocksDB's tendency toward read and write 
>>>> amplification.
>>>>
>>>> The new test results look very promising.  We've dramatically
>>>> improved performance of random writes at most IO sizes, so that they
>>>> are now typically quite a bit higher than both filestore and older
>>>> bluestore code.  Unfortunately for very small IO sizes performance
>>>> hasn't improved much.  We are no longer seeing huge amounts of
>>>> RocksDB read traffic and fewer write stalls.  We are however seeing
>>>> huge memory usage (~9GB RSS per OSD) and very high CPU usage.  I
>>>> think this confirms some of the memory issues somnath was continuing
>>>> to see.  I don't think it's a leak exactly based on how the OSDs were
>>>> behaving, but we need to run through massif still to be sure.
>>>>
>>>> I ended up spending some time tonight with perf and digging through
>>>> the encode code.  I wrote up some notes with graphs and code snippets
>>>> and decided to put them up on the web.  Basically some of the
>>>> encoding changes we implemented last month to reduce the onode size
>>>> also appear to result in more buffer::list appends and the associated
>>>> overhead.
>>>> I've been trying to think through ways to improve the situation and
>>>> thought other people might have some ideas too.  Here's a link to the
>>>> short writeup:
>>>>
>>>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?usp=sharing 
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Mark
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at http://vger.kernel.org/majordomo-info.html
>>>> PLEASE NOTE: The information contained in this electronic mail
>>>> message is intended only for the use of the designated recipient(s)
>>>> named above. If the reader of this message is not the intended
>>>> recipient, you are hereby notified that you have received this
>>>> message in error and that any review, dissemination, distribution, or
>>>> copying of this message is strictly prohibited. If you have received
>>>> this communication in error, please notify the sender by telephone or
>>>> e-mail (as shown above) immediately and destroy any and all copies of
>>>> this message in your possession (whether hard copies or
>>>> electronically stored copies).
>>>> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml= 
>>>>
>>>>
>>>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-12 12:57         ` Igor Fedotov
@ 2016-07-12 13:02           ` Mark Nelson
  2016-07-12 15:14             ` Somnath Roy
  0 siblings, 1 reply; 39+ messages in thread
From: Mark Nelson @ 2016-07-12 13:02 UTC (permalink / raw)
  To: Igor Fedotov, Somnath Roy, ceph-devel

In this case I'm assigning per OSD:

1G Data (basically the top level OSD dir)
1G WAL
8G DB
140G Block

Mark

On 07/12/2016 07:57 AM, Igor Fedotov wrote:
> Mark,
>
> you can find my post named 'yet another assertion in bluestore during
> random write' last week. It contains steps to reproduce in my case.
>
> Also I did some investigations (still incomplete though) with tuning
> 'bluestore block db size' and 'bluestore block wal size'. Setting both
> to 256M fixes the issue for me.
>
> But I'm still uncertain if that's a bug or just inappropriate settings...
>
>
> Thanks,
>
> Igor
>
>
> On 12.07.2016 15:48, Mark Nelson wrote:
>> Oh, that's good to know!  Have you tracked it down at all?  I noticed
>> pretty extreme memory usage on the OSDs still, so that might be part
>> of it.  I'm doing a massif run now.
>>
>> Mark
>>
>> On 07/12/2016 07:40 AM, Igor Fedotov wrote:
>>> That's similar to what I have while running my test case with vstart...
>>> Without Somnath's settings though..
>>>
>>>
>>> On 12.07.2016 15:34, Mark Nelson wrote:
>>>> Hi Somnath,
>>>>
>>>> I accidentally screwed up my first run with your settings but reran
>>>> last night.  With your tuning the OSDs are failing to allocate to
>>>> bdev0 after about 30 minutes of testing:
>>>>
>>>> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed to
>>>> allocate 1048576 on bdev 0, free 0; fallback to bdev 1
>>>>
>>>> They are able to continue running, but ultimately this leads to an
>>>> assert later on.  I wonder if it's not compacting fast enough and ends
>>>> up consuming the entire disk with stale metadata.
>>>>
>>>> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1
>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In function
>>>> 'int BlueFS::_allocate(unsigned int, uint64_t,
>>>> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12
>>>> 04:31:02.627138
>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398: FAILED
>>>> assert(0 == "allocate failed... wtf")
>>>>
>>>>  ceph version v10.0.4-6936-gc7da2f7
>>>> (c7da2f7c869694246650a9276a2b67aed9bf818f)
>>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>> const*)+0x85) [0xd4cb75]
>>>>  2: (BlueFS::_allocate(unsigned int, unsigned long,
>>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t>
>>>> >*)+0x760) [0xb98220]
>>>>  3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
>>>>  4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
>>>>  5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
>>>>  6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
>>>>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
>>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
>>>> unsigned long, bool)+0x1456) [0xbfdb96]
>>>>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>>>> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
>>>>  9:
>>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x6b)
>>>>
>>>> [0xb3df2b]
>>>>  10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
>>>>  11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
>>>>  12: (()+0x7dc5) [0x7f0d185c4dc5]
>>>>  13: (clone()+0x6d) [0x7f0d164bf28d]
>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>> needed to interpret this.
>>>>
>>>>
>>>> On 07/12/2016 02:13 AM, Somnath Roy wrote:
>>>>> Thanks Mark !
>>>>> Yes, quite similar result I am also seeing for 4K RW. BTW, did you
>>>>> get chance to try out the rocksdb tuning I posted earlier ? It may
>>>>> reduce the stalls in your environment.
>>>>>
>>>>> Regards
>>>>> Somnath
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>> Sent: Tuesday, July 12, 2016 12:03 AM
>>>>> To: ceph-devel
>>>>> Subject: bluestore onode diet and encoding overhead
>>>>>
>>>>> Hi All,
>>>>>
>>>>> With Igor's patch last week I was able to get some bluestore
>>>>> performance runs in without segfaulting and started looking int the
>>>>> results.
>>>>> Somewhere along the line we really screwed up read performance, but
>>>>> that's another topic.  Right now I want to focus on random writes.
>>>>> Before we put the onode on a diet we were seeing massive amounts of
>>>>> read traffic in RocksDB during compaction that caused write stalls
>>>>> during 4K random writes.  Random write performance on fast hardware
>>>>> like NVMe devices was often below filestore at anything other than
>>>>> very large IO sizes.  This was largely due to the size of the onode
>>>>> compounded with RocksDB's tendency toward read and write
>>>>> amplification.
>>>>>
>>>>> The new test results look very promising.  We've dramatically
>>>>> improved performance of random writes at most IO sizes, so that they
>>>>> are now typically quite a bit higher than both filestore and older
>>>>> bluestore code.  Unfortunately for very small IO sizes performance
>>>>> hasn't improved much.  We are no longer seeing huge amounts of
>>>>> RocksDB read traffic and fewer write stalls.  We are however seeing
>>>>> huge memory usage (~9GB RSS per OSD) and very high CPU usage.  I
>>>>> think this confirms some of the memory issues somnath was continuing
>>>>> to see.  I don't think it's a leak exactly based on how the OSDs were
>>>>> behaving, but we need to run through massif still to be sure.
>>>>>
>>>>> I ended up spending some time tonight with perf and digging through
>>>>> the encode code.  I wrote up some notes with graphs and code snippets
>>>>> and decided to put them up on the web.  Basically some of the
>>>>> encoding changes we implemented last month to reduce the onode size
>>>>> also appear to result in more buffer::list appends and the associated
>>>>> overhead.
>>>>> I've been trying to think through ways to improve the situation and
>>>>> thought other people might have some ideas too.  Here's a link to the
>>>>> short writeup:
>>>>>
>>>>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?usp=sharing
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Mark
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>> info at http://vger.kernel.org/majordomo-info.html
>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>> message is intended only for the use of the designated recipient(s)
>>>>> named above. If the reader of this message is not the intended
>>>>> recipient, you are hereby notified that you have received this
>>>>> message in error and that any review, dissemination, distribution, or
>>>>> copying of this message is strictly prohibited. If you have received
>>>>> this communication in error, please notify the sender by telephone or
>>>>> e-mail (as shown above) immediately and destroy any and all copies of
>>>>> this message in your possession (whether hard copies or
>>>>> electronically stored copies).
>>>>> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
>>>>>
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-12 13:02           ` Mark Nelson
@ 2016-07-12 15:14             ` Somnath Roy
  2016-07-12 15:31               ` Igor Fedotov
                                 ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Somnath Roy @ 2016-07-12 15:14 UTC (permalink / raw)
  To: Mark Nelson, Igor Fedotov, ceph-devel

Mark,
Recently, the default allocator is changed to Bitmap and I saw it is returning < 0 return value only in the following case.

  count = m_bit_alloc->alloc_blocks_res(nblks, &start_blk);
  if (count == 0) {
    return -ENOSPC;
  }

So, it seems it may not be the memory but db partition is getting out of space (?). I never faced it so far as I was running with 100GB of db partition may be.
The amount of metadata write going on to the db even after onode diet is starting from ~1K and over time it is reaching > 4k or so (I checked for 4K RW). It is growing as extents are growing. So, 8 GB may not be enough.
If this is true, next challenge is , how to automatically (or document) the size of rocksdb db partition based on the data partition size. For example, in the ZS case, we have calculated that we need ~9G db space per TB. We need to do similar calculation for rocksbd as well.

Thanks & Regards
Somnath


-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com]
Sent: Tuesday, July 12, 2016 6:03 AM
To: Igor Fedotov; Somnath Roy; ceph-devel
Subject: Re: bluestore onode diet and encoding overhead

In this case I'm assigning per OSD:

1G Data (basically the top level OSD dir) 1G WAL 8G DB 140G Block

Mark

On 07/12/2016 07:57 AM, Igor Fedotov wrote:
> Mark,
>
> you can find my post named 'yet another assertion in bluestore during
> random write' last week. It contains steps to reproduce in my case.
>
> Also I did some investigations (still incomplete though) with tuning
> 'bluestore block db size' and 'bluestore block wal size'. Setting both
> to 256M fixes the issue for me.
>
> But I'm still uncertain if that's a bug or just inappropriate settings...
>
>
> Thanks,
>
> Igor
>
>
> On 12.07.2016 15:48, Mark Nelson wrote:
>> Oh, that's good to know!  Have you tracked it down at all?  I noticed
>> pretty extreme memory usage on the OSDs still, so that might be part
>> of it.  I'm doing a massif run now.
>>
>> Mark
>>
>> On 07/12/2016 07:40 AM, Igor Fedotov wrote:
>>> That's similar to what I have while running my test case with vstart...
>>> Without Somnath's settings though..
>>>
>>>
>>> On 12.07.2016 15:34, Mark Nelson wrote:
>>>> Hi Somnath,
>>>>
>>>> I accidentally screwed up my first run with your settings but reran
>>>> last night.  With your tuning the OSDs are failing to allocate to
>>>> bdev0 after about 30 minutes of testing:
>>>>
>>>> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed
>>>> to allocate 1048576 on bdev 0, free 0; fallback to bdev 1
>>>>
>>>> They are able to continue running, but ultimately this leads to an
>>>> assert later on.  I wonder if it's not compacting fast enough and
>>>> ends up consuming the entire disk with stale metadata.
>>>>
>>>> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1
>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In
>>>> function 'int BlueFS::_allocate(unsigned int, uint64_t,
>>>> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12
>>>> 04:31:02.627138
>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398:
>>>> FAILED
>>>> assert(0 == "allocate failed... wtf")
>>>>
>>>>  ceph version v10.0.4-6936-gc7da2f7
>>>> (c7da2f7c869694246650a9276a2b67aed9bf818f)
>>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>> const*)+0x85) [0xd4cb75]
>>>>  2: (BlueFS::_allocate(unsigned int, unsigned long,
>>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t>
>>>> >*)+0x760) [0xb98220]
>>>>  3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
>>>>  4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
>>>>  5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
>>>>  6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
>>>>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
>>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
>>>> unsigned long, bool)+0x1456) [0xbfdb96]
>>>>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>>>> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
>>>>  9:
>>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::
>>>> TransactionImpl>)+0x6b)
>>>>
>>>> [0xb3df2b]
>>>>  10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
>>>>  11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
>>>>  12: (()+0x7dc5) [0x7f0d185c4dc5]
>>>>  13: (clone()+0x6d) [0x7f0d164bf28d]
>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>> needed to interpret this.
>>>>
>>>>
>>>> On 07/12/2016 02:13 AM, Somnath Roy wrote:
>>>>> Thanks Mark !
>>>>> Yes, quite similar result I am also seeing for 4K RW. BTW, did you
>>>>> get chance to try out the rocksdb tuning I posted earlier ? It may
>>>>> reduce the stalls in your environment.
>>>>>
>>>>> Regards
>>>>> Somnath
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>> Sent: Tuesday, July 12, 2016 12:03 AM
>>>>> To: ceph-devel
>>>>> Subject: bluestore onode diet and encoding overhead
>>>>>
>>>>> Hi All,
>>>>>
>>>>> With Igor's patch last week I was able to get some bluestore
>>>>> performance runs in without segfaulting and started looking int
>>>>> the results.
>>>>> Somewhere along the line we really screwed up read performance,
>>>>> but that's another topic.  Right now I want to focus on random writes.
>>>>> Before we put the onode on a diet we were seeing massive amounts
>>>>> of read traffic in RocksDB during compaction that caused write
>>>>> stalls during 4K random writes.  Random write performance on fast
>>>>> hardware like NVMe devices was often below filestore at anything
>>>>> other than very large IO sizes.  This was largely due to the size
>>>>> of the onode compounded with RocksDB's tendency toward read and
>>>>> write amplification.
>>>>>
>>>>> The new test results look very promising.  We've dramatically
>>>>> improved performance of random writes at most IO sizes, so that
>>>>> they are now typically quite a bit higher than both filestore and
>>>>> older bluestore code.  Unfortunately for very small IO sizes
>>>>> performance hasn't improved much.  We are no longer seeing huge
>>>>> amounts of RocksDB read traffic and fewer write stalls.  We are
>>>>> however seeing huge memory usage (~9GB RSS per OSD) and very high
>>>>> CPU usage.  I think this confirms some of the memory issues
>>>>> somnath was continuing to see.  I don't think it's a leak exactly
>>>>> based on how the OSDs were behaving, but we need to run through massif still to be sure.
>>>>>
>>>>> I ended up spending some time tonight with perf and digging
>>>>> through the encode code.  I wrote up some notes with graphs and
>>>>> code snippets and decided to put them up on the web.  Basically
>>>>> some of the encoding changes we implemented last month to reduce
>>>>> the onode size also appear to result in more buffer::list appends
>>>>> and the associated overhead.
>>>>> I've been trying to think through ways to improve the situation
>>>>> and thought other people might have some ideas too.  Here's a link
>>>>> to the short writeup:
>>>>>
>>>>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?
>>>>> usp=sharing
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Mark
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>> message is intended only for the use of the designated
>>>>> recipient(s) named above. If the reader of this message is not the
>>>>> intended recipient, you are hereby notified that you have received
>>>>> this message in error and that any review, dissemination,
>>>>> distribution, or copying of this message is strictly prohibited.
>>>>> If you have received this communication in error, please notify
>>>>> the sender by telephone or e-mail (as shown above) immediately and
>>>>> destroy any and all copies of this message in your possession
>>>>> (whether hard copies or electronically stored copies).
>>>>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j
>>>>>   f   h   z \x1e w
   j:+v   w j m         zZ+     ݢj"  !tml=
>>>>>
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-12  7:03 bluestore onode diet and encoding overhead Mark Nelson
  2016-07-12  7:13 ` Somnath Roy
@ 2016-07-12 15:20 ` Allen Samuels
  2016-07-12 15:37   ` Mark Nelson
  2016-07-13  1:50   ` Sage Weil
  1 sibling, 2 replies; 39+ messages in thread
From: Allen Samuels @ 2016-07-12 15:20 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel

Good analysis. 

My original comments about putting the oNode on a diet included the idea of a "custom" encode/decode path for certain high-usage cases. At the time, Sage resisted going down that path hoping that a more optimized generic case would get the job done. Your analysis shows that while we've achieved significant space reduction this has come at the expense of CPU time -- which dominates small object performance (I suspect that eventually we'd discover that the variable length decode path would be responsible for a substantial read performance degradation also -- which may or may not be part of the read performance drop-off that you're seeing). This isn't a surprising result, though it is unfortunate.

I believe we need to revisit the idea of custom encode/decode paths for high-usage cases, only now the gains need to be focused on CPU utilization as well as space efficiency.

I believe this activity can also address some of the memory consumption issues that we're seeing now. I believe that the current lextent/blob/pextent usage of standard STL maps is both space and time inefficient -- in a place where it matters a lot. Sage has already discussed usage of something like flat_map from the boost library as a way to reduce the memory overhead, etc. I believe this is the right direction.

Where are we on getting boost into our build? 

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, July 12, 2016 12:03 AM
> To: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: bluestore onode diet and encoding overhead
> 
> Hi All,
> 
> With Igor's patch last week I was able to get some bluestore performance
> runs in without segfaulting and started looking int the results.
> Somewhere along the line we really screwed up read performance, but
> that's another topic.  Right now I want to focus on random writes.
> Before we put the onode on a diet we were seeing massive amounts of read
> traffic in RocksDB during compaction that caused write stalls during 4K
> random writes.  Random write performance on fast hardware like NVMe
> devices was often below filestore at anything other than very large IO sizes.
> This was largely due to the size of the onode compounded with RocksDB's
> tendency toward read and write amplification.
> 
> The new test results look very promising.  We've dramatically improved
> performance of random writes at most IO sizes, so that they are now
> typically quite a bit higher than both filestore and older bluestore code.
> Unfortunately for very small IO sizes performance hasn't improved much.
> We are no longer seeing huge amounts of RocksDB read traffic and fewer
> write stalls.  We are however seeing huge memory usage (~9GB RSS per
> OSD) and very high CPU usage.  I think this confirms some of the memory
> issues somnath was continuing to see.  I don't think it's a leak exactly based
> on how the OSDs were behaving, but we need to run through massif still to
> be sure.
> 
> I ended up spending some time tonight with perf and digging through the
> encode code.  I wrote up some notes with graphs and code snippets and
> decided to put them up on the web.  Basically some of the encoding changes
> we implemented last month to reduce the onode size also appear to result in
> more buffer::list appends and the associated overhead.
> I've been trying to think through ways to improve the situation and thought
> other people might have some ideas too.  Here's a link to the short writeup:
> 
> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?us
> p=sharing
> 
> Thanks,
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-12 15:14             ` Somnath Roy
@ 2016-07-12 15:31               ` Igor Fedotov
  2016-07-12 15:36                 ` Somnath Roy
  2016-07-12 15:37               ` Varada Kari
  2016-07-12 16:56               ` Sage Weil
  2 siblings, 1 reply; 39+ messages in thread
From: Igor Fedotov @ 2016-07-12 15:31 UTC (permalink / raw)
  To: Somnath Roy, Mark Nelson, ceph-devel

Somnath,

yeah,  you're right about db partition is getting out of space:

-876> 2016-07-12 18:19:26.133795 7f8e6ddb7700 10 bluefs get_usage bdev 0 
free 0 (0 B) / 268431360 (255 MB), used 100%
-875> 2016-07-12 18:19:26.133796 7f8e6ddb7700 10 bluefs get_usage bdev 1 
free 193986560 (185 MB) / 268427264 (255 MB), used 27%
-874> 2016-07-12 18:19:26.133797 7f8e6ddb7700 10 bluefs get_usage bdev 2 
free 1073741824 (1024 MB) / 1074782208 (1024 MB), used 0%

And I don't see much RAM consumption in this case.

But the curious thing about my test case is that it shouldn't increase 
amount of metadata written as I'm doing writes within the first megabyte 
only( see fio script I posted last week).

Looks like somebody wastes DB space - usage at bdev 0 is constantly 
growing while I'm running the test case...
And another observation - the issue isn't reproduced with stupid 
allocator hence I suspect some bug in bitmap one...

Thanks,
Igor


On 12.07.2016 18:14, Somnath Roy wrote:
> Mark,
> Recently, the default allocator is changed to Bitmap and I saw it is returning < 0 return value only in the following case.
>
>    count = m_bit_alloc->alloc_blocks_res(nblks, &start_blk);
>    if (count == 0) {
>      return -ENOSPC;
>    }
>
> So, it seems it may not be the memory but db partition is getting out of space (?). I never faced it so far as I was running with 100GB of db partition may be.
> The amount of metadata write going on to the db even after onode diet is starting from ~1K and over time it is reaching > 4k or so (I checked for 4K RW). It is growing as extents are growing. So, 8 GB may not be enough.
> If this is true, next challenge is , how to automatically (or document) the size of rocksdb db partition based on the data partition size. For example, in the ZS case, we have calculated that we need ~9G db space per TB. We need to do similar calculation for rocksbd as well.
>
> Thanks & Regards
> Somnath
>
>
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, July 12, 2016 6:03 AM
> To: Igor Fedotov; Somnath Roy; ceph-devel
> Subject: Re: bluestore onode diet and encoding overhead
>
> In this case I'm assigning per OSD:
>
> 1G Data (basically the top level OSD dir) 1G WAL 8G DB 140G Block
>
> Mark
>
> On 07/12/2016 07:57 AM, Igor Fedotov wrote:
>> Mark,
>>
>> you can find my post named 'yet another assertion in bluestore during
>> random write' last week. It contains steps to reproduce in my case.
>>
>> Also I did some investigations (still incomplete though) with tuning
>> 'bluestore block db size' and 'bluestore block wal size'. Setting both
>> to 256M fixes the issue for me.
>>
>> But I'm still uncertain if that's a bug or just inappropriate settings...
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 12.07.2016 15:48, Mark Nelson wrote:
>>> Oh, that's good to know!  Have you tracked it down at all?  I noticed
>>> pretty extreme memory usage on the OSDs still, so that might be part
>>> of it.  I'm doing a massif run now.
>>>
>>> Mark
>>>
>>> On 07/12/2016 07:40 AM, Igor Fedotov wrote:
>>>> That's similar to what I have while running my test case with vstart...
>>>> Without Somnath's settings though..
>>>>
>>>>
>>>> On 12.07.2016 15:34, Mark Nelson wrote:
>>>>> Hi Somnath,
>>>>>
>>>>> I accidentally screwed up my first run with your settings but reran
>>>>> last night.  With your tuning the OSDs are failing to allocate to
>>>>> bdev0 after about 30 minutes of testing:
>>>>>
>>>>> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed
>>>>> to allocate 1048576 on bdev 0, free 0; fallback to bdev 1
>>>>>
>>>>> They are able to continue running, but ultimately this leads to an
>>>>> assert later on.  I wonder if it's not compacting fast enough and
>>>>> ends up consuming the entire disk with stale metadata.
>>>>>
>>>>> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1
>>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In
>>>>> function 'int BlueFS::_allocate(unsigned int, uint64_t,
>>>>> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12
>>>>> 04:31:02.627138
>>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398:
>>>>> FAILED
>>>>> assert(0 == "allocate failed... wtf")
>>>>>
>>>>>   ceph version v10.0.4-6936-gc7da2f7
>>>>> (c7da2f7c869694246650a9276a2b67aed9bf818f)
>>>>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>>> const*)+0x85) [0xd4cb75]
>>>>>   2: (BlueFS::_allocate(unsigned int, unsigned long,
>>>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t>
>>>>>> *)+0x760) [0xb98220]
>>>>>   3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
>>>>>   4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
>>>>>   5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
>>>>>   6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
>>>>>   7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
>>>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
>>>>> unsigned long, bool)+0x1456) [0xbfdb96]
>>>>>   8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>>>>> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
>>>>>   9:
>>>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::
>>>>> TransactionImpl>)+0x6b)
>>>>>
>>>>> [0xb3df2b]
>>>>>   10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
>>>>>   11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
>>>>>   12: (()+0x7dc5) [0x7f0d185c4dc5]
>>>>>   13: (clone()+0x6d) [0x7f0d164bf28d]
>>>>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>> needed to interpret this.
>>>>>
>>>>>
>>>>> On 07/12/2016 02:13 AM, Somnath Roy wrote:
>>>>>> Thanks Mark !
>>>>>> Yes, quite similar result I am also seeing for 4K RW. BTW, did you
>>>>>> get chance to try out the rocksdb tuning I posted earlier ? It may
>>>>>> reduce the stalls in your environment.
>>>>>>
>>>>>> Regards
>>>>>> Somnath
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>>> Sent: Tuesday, July 12, 2016 12:03 AM
>>>>>> To: ceph-devel
>>>>>> Subject: bluestore onode diet and encoding overhead
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> With Igor's patch last week I was able to get some bluestore
>>>>>> performance runs in without segfaulting and started looking int
>>>>>> the results.
>>>>>> Somewhere along the line we really screwed up read performance,
>>>>>> but that's another topic.  Right now I want to focus on random writes.
>>>>>> Before we put the onode on a diet we were seeing massive amounts
>>>>>> of read traffic in RocksDB during compaction that caused write
>>>>>> stalls during 4K random writes.  Random write performance on fast
>>>>>> hardware like NVMe devices was often below filestore at anything
>>>>>> other than very large IO sizes.  This was largely due to the size
>>>>>> of the onode compounded with RocksDB's tendency toward read and
>>>>>> write amplification.
>>>>>>
>>>>>> The new test results look very promising.  We've dramatically
>>>>>> improved performance of random writes at most IO sizes, so that
>>>>>> they are now typically quite a bit higher than both filestore and
>>>>>> older bluestore code.  Unfortunately for very small IO sizes
>>>>>> performance hasn't improved much.  We are no longer seeing huge
>>>>>> amounts of RocksDB read traffic and fewer write stalls.  We are
>>>>>> however seeing huge memory usage (~9GB RSS per OSD) and very high
>>>>>> CPU usage.  I think this confirms some of the memory issues
>>>>>> somnath was continuing to see.  I don't think it's a leak exactly
>>>>>> based on how the OSDs were behaving, but we need to run through massif still to be sure.
>>>>>>
>>>>>> I ended up spending some time tonight with perf and digging
>>>>>> through the encode code.  I wrote up some notes with graphs and
>>>>>> code snippets and decided to put them up on the web.  Basically
>>>>>> some of the encoding changes we implemented last month to reduce
>>>>>> the onode size also appear to result in more buffer::list appends
>>>>>> and the associated overhead.
>>>>>> I've been trying to think through ways to improve the situation
>>>>>> and thought other people might have some ideas too.  Here's a link
>>>>>> to the short writeup:
>>>>>>
>>>>>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?
>>>>>> usp=sharing
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Mark
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>> message is intended only for the use of the designated
>>>>>> recipient(s) named above. If the reader of this message is not the
>>>>>> intended recipient, you are hereby notified that you have received
>>>>>> this message in error and that any review, dissemination,
>>>>>> distribution, or copying of this message is strictly prohibited.
>>>>>> If you have received this communication in error, please notify
>>>>>> the sender by telephone or e-mail (as shown above) immediately and
>>>>>> destroy any and all copies of this message in your possession
>>>>>> (whether hard copies or electronically stored copies).
>>>>>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j
>>>>>>    f   h   z \x1e w
>     j:+v   w j m         zZ+     ݢj"  !tml=
>>>>>>
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-12 15:31               ` Igor Fedotov
@ 2016-07-12 15:36                 ` Somnath Roy
  2016-07-12 15:46                   ` Mark Nelson
  0 siblings, 1 reply; 39+ messages in thread
From: Somnath Roy @ 2016-07-12 15:36 UTC (permalink / raw)
  To: Igor Fedotov, Mark Nelson, ceph-devel

<< And another observation - the issue isn't reproduced with stupid allocator hence I suspect some bug in bitmap one
I was about to co-relate that , it seems a bug in Bitamp allocator then.
I need to check the memory growth is also related to Bitmap allocator related or not. I will do some digging.

Thanks & Regards
Somnath
-----Original Message-----
From: Igor Fedotov [mailto:ifedotov@mirantis.com] 
Sent: Tuesday, July 12, 2016 8:32 AM
To: Somnath Roy; Mark Nelson; ceph-devel
Subject: Re: bluestore onode diet and encoding overhead

Somnath,

yeah,  you're right about db partition is getting out of space:

-876> 2016-07-12 18:19:26.133795 7f8e6ddb7700 10 bluefs get_usage bdev 0 
free 0 (0 B) / 268431360 (255 MB), used 100%
-875> 2016-07-12 18:19:26.133796 7f8e6ddb7700 10 bluefs get_usage bdev 1 
free 193986560 (185 MB) / 268427264 (255 MB), used 27%
-874> 2016-07-12 18:19:26.133797 7f8e6ddb7700 10 bluefs get_usage bdev 2 
free 1073741824 (1024 MB) / 1074782208 (1024 MB), used 0%

And I don't see much RAM consumption in this case.

But the curious thing about my test case is that it shouldn't increase 
amount of metadata written as I'm doing writes within the first megabyte 
only( see fio script I posted last week).

Looks like somebody wastes DB space - usage at bdev 0 is constantly 
growing while I'm running the test case...
And another observation - the issue isn't reproduced with stupid 
allocator hence I suspect some bug in bitmap one...

Thanks,
Igor


On 12.07.2016 18:14, Somnath Roy wrote:
> Mark,
> Recently, the default allocator is changed to Bitmap and I saw it is returning < 0 return value only in the following case.
>
>    count = m_bit_alloc->alloc_blocks_res(nblks, &start_blk);
>    if (count == 0) {
>      return -ENOSPC;
>    }
>
> So, it seems it may not be the memory but db partition is getting out of space (?). I never faced it so far as I was running with 100GB of db partition may be.
> The amount of metadata write going on to the db even after onode diet is starting from ~1K and over time it is reaching > 4k or so (I checked for 4K RW). It is growing as extents are growing. So, 8 GB may not be enough.
> If this is true, next challenge is , how to automatically (or document) the size of rocksdb db partition based on the data partition size. For example, in the ZS case, we have calculated that we need ~9G db space per TB. We need to do similar calculation for rocksbd as well.
>
> Thanks & Regards
> Somnath
>
>
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, July 12, 2016 6:03 AM
> To: Igor Fedotov; Somnath Roy; ceph-devel
> Subject: Re: bluestore onode diet and encoding overhead
>
> In this case I'm assigning per OSD:
>
> 1G Data (basically the top level OSD dir) 1G WAL 8G DB 140G Block
>
> Mark
>
> On 07/12/2016 07:57 AM, Igor Fedotov wrote:
>> Mark,
>>
>> you can find my post named 'yet another assertion in bluestore during
>> random write' last week. It contains steps to reproduce in my case.
>>
>> Also I did some investigations (still incomplete though) with tuning
>> 'bluestore block db size' and 'bluestore block wal size'. Setting both
>> to 256M fixes the issue for me.
>>
>> But I'm still uncertain if that's a bug or just inappropriate settings...
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 12.07.2016 15:48, Mark Nelson wrote:
>>> Oh, that's good to know!  Have you tracked it down at all?  I noticed
>>> pretty extreme memory usage on the OSDs still, so that might be part
>>> of it.  I'm doing a massif run now.
>>>
>>> Mark
>>>
>>> On 07/12/2016 07:40 AM, Igor Fedotov wrote:
>>>> That's similar to what I have while running my test case with vstart...
>>>> Without Somnath's settings though..
>>>>
>>>>
>>>> On 12.07.2016 15:34, Mark Nelson wrote:
>>>>> Hi Somnath,
>>>>>
>>>>> I accidentally screwed up my first run with your settings but reran
>>>>> last night.  With your tuning the OSDs are failing to allocate to
>>>>> bdev0 after about 30 minutes of testing:
>>>>>
>>>>> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed
>>>>> to allocate 1048576 on bdev 0, free 0; fallback to bdev 1
>>>>>
>>>>> They are able to continue running, but ultimately this leads to an
>>>>> assert later on.  I wonder if it's not compacting fast enough and
>>>>> ends up consuming the entire disk with stale metadata.
>>>>>
>>>>> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1
>>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In
>>>>> function 'int BlueFS::_allocate(unsigned int, uint64_t,
>>>>> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12
>>>>> 04:31:02.627138
>>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398:
>>>>> FAILED
>>>>> assert(0 == "allocate failed... wtf")
>>>>>
>>>>>   ceph version v10.0.4-6936-gc7da2f7
>>>>> (c7da2f7c869694246650a9276a2b67aed9bf818f)
>>>>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>>> const*)+0x85) [0xd4cb75]
>>>>>   2: (BlueFS::_allocate(unsigned int, unsigned long,
>>>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t>
>>>>>> *)+0x760) [0xb98220]
>>>>>   3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
>>>>>   4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
>>>>>   5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
>>>>>   6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
>>>>>   7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
>>>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
>>>>> unsigned long, bool)+0x1456) [0xbfdb96]
>>>>>   8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>>>>> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
>>>>>   9:
>>>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::
>>>>> TransactionImpl>)+0x6b)
>>>>>
>>>>> [0xb3df2b]
>>>>>   10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
>>>>>   11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
>>>>>   12: (()+0x7dc5) [0x7f0d185c4dc5]
>>>>>   13: (clone()+0x6d) [0x7f0d164bf28d]
>>>>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>> needed to interpret this.
>>>>>
>>>>>
>>>>> On 07/12/2016 02:13 AM, Somnath Roy wrote:
>>>>>> Thanks Mark !
>>>>>> Yes, quite similar result I am also seeing for 4K RW. BTW, did you
>>>>>> get chance to try out the rocksdb tuning I posted earlier ? It may
>>>>>> reduce the stalls in your environment.
>>>>>>
>>>>>> Regards
>>>>>> Somnath
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>>> Sent: Tuesday, July 12, 2016 12:03 AM
>>>>>> To: ceph-devel
>>>>>> Subject: bluestore onode diet and encoding overhead
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> With Igor's patch last week I was able to get some bluestore
>>>>>> performance runs in without segfaulting and started looking int
>>>>>> the results.
>>>>>> Somewhere along the line we really screwed up read performance,
>>>>>> but that's another topic.  Right now I want to focus on random writes.
>>>>>> Before we put the onode on a diet we were seeing massive amounts
>>>>>> of read traffic in RocksDB during compaction that caused write
>>>>>> stalls during 4K random writes.  Random write performance on fast
>>>>>> hardware like NVMe devices was often below filestore at anything
>>>>>> other than very large IO sizes.  This was largely due to the size
>>>>>> of the onode compounded with RocksDB's tendency toward read and
>>>>>> write amplification.
>>>>>>
>>>>>> The new test results look very promising.  We've dramatically
>>>>>> improved performance of random writes at most IO sizes, so that
>>>>>> they are now typically quite a bit higher than both filestore and
>>>>>> older bluestore code.  Unfortunately for very small IO sizes
>>>>>> performance hasn't improved much.  We are no longer seeing huge
>>>>>> amounts of RocksDB read traffic and fewer write stalls.  We are
>>>>>> however seeing huge memory usage (~9GB RSS per OSD) and very high
>>>>>> CPU usage.  I think this confirms some of the memory issues
>>>>>> somnath was continuing to see.  I don't think it's a leak exactly
>>>>>> based on how the OSDs were behaving, but we need to run through massif still to be sure.
>>>>>>
>>>>>> I ended up spending some time tonight with perf and digging
>>>>>> through the encode code.  I wrote up some notes with graphs and
>>>>>> code snippets and decided to put them up on the web.  Basically
>>>>>> some of the encoding changes we implemented last month to reduce
>>>>>> the onode size also appear to result in more buffer::list appends
>>>>>> and the associated overhead.
>>>>>> I've been trying to think through ways to improve the situation
>>>>>> and thought other people might have some ideas too.  Here's a link
>>>>>> to the short writeup:
>>>>>>
>>>>>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?
>>>>>> usp=sharing
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Mark
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>> message is intended only for the use of the designated
>>>>>> recipient(s) named above. If the reader of this message is not the
>>>>>> intended recipient, you are hereby notified that you have received
>>>>>> this message in error and that any review, dissemination,
>>>>>> distribution, or copying of this message is strictly prohibited.
>>>>>> If you have received this communication in error, please notify
>>>>>> the sender by telephone or e-mail (as shown above) immediately and
>>>>>> destroy any and all copies of this message in your possession
>>>>>> (whether hard copies or electronically stored copies).
>>>>>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j
>>>>>>    f   h   z \x1e w
>     j:+v   w j m         zZ+     ݢj"  !tml=
>>>>>>
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-12 15:14             ` Somnath Roy
  2016-07-12 15:31               ` Igor Fedotov
@ 2016-07-12 15:37               ` Varada Kari
  2016-07-12 16:56               ` Sage Weil
  2 siblings, 0 replies; 39+ messages in thread
From: Varada Kari @ 2016-07-12 15:37 UTC (permalink / raw)
  To: Somnath Roy, Mark Nelson, Igor Fedotov, ceph-devel

This one seems to be, we have space(fragmented) for the allocation, but
not able to allocate in contiguous way.
I can reproduce this problem in a unit test. will check with Ramesh on that.

Varada


On Tuesday 12 July 2016 08:45 PM, Somnath Roy wrote:
> Mark,
> Recently, the default allocator is changed to Bitmap and I saw it is returning < 0 return value only in the following case.
>
>   count = m_bit_alloc->alloc_blocks_res(nblks, &start_blk);
>   if (count == 0) {
>     return -ENOSPC;
>   }
>
> So, it seems it may not be the memory but db partition is getting out of space (?). I never faced it so far as I was running with 100GB of db partition may be.
> The amount of metadata write going on to the db even after onode diet is starting from ~1K and over time it is reaching > 4k or so (I checked for 4K RW). It is growing as extents are growing. So, 8 GB may not be enough.
> If this is true, next challenge is , how to automatically (or document) the size of rocksdb db partition based on the data partition size. For example, in the ZS case, we have calculated that we need ~9G db space per TB. We need to do similar calculation for rocksbd as well.
>
> Thanks & Regards
> Somnath
>
>
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, July 12, 2016 6:03 AM
> To: Igor Fedotov; Somnath Roy; ceph-devel
> Subject: Re: bluestore onode diet and encoding overhead
>
> In this case I'm assigning per OSD:
>
> 1G Data (basically the top level OSD dir) 1G WAL 8G DB 140G Block
>
> Mark
>
> On 07/12/2016 07:57 AM, Igor Fedotov wrote:
>> Mark,
>>
>> you can find my post named 'yet another assertion in bluestore during
>> random write' last week. It contains steps to reproduce in my case.
>>
>> Also I did some investigations (still incomplete though) with tuning
>> 'bluestore block db size' and 'bluestore block wal size'. Setting both
>> to 256M fixes the issue for me.
>>
>> But I'm still uncertain if that's a bug or just inappropriate settings...
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 12.07.2016 15:48, Mark Nelson wrote:
>>> Oh, that's good to know!  Have you tracked it down at all?  I noticed
>>> pretty extreme memory usage on the OSDs still, so that might be part
>>> of it.  I'm doing a massif run now.
>>>
>>> Mark
>>>
>>> On 07/12/2016 07:40 AM, Igor Fedotov wrote:
>>>> That's similar to what I have while running my test case with vstart...
>>>> Without Somnath's settings though..
>>>>
>>>>
>>>> On 12.07.2016 15:34, Mark Nelson wrote:
>>>>> Hi Somnath,
>>>>>
>>>>> I accidentally screwed up my first run with your settings but reran
>>>>> last night.  With your tuning the OSDs are failing to allocate to
>>>>> bdev0 after about 30 minutes of testing:
>>>>>
>>>>> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed
>>>>> to allocate 1048576 on bdev 0, free 0; fallback to bdev 1
>>>>>
>>>>> They are able to continue running, but ultimately this leads to an
>>>>> assert later on.  I wonder if it's not compacting fast enough and
>>>>> ends up consuming the entire disk with stale metadata.
>>>>>
>>>>> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1
>>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In
>>>>> function 'int BlueFS::_allocate(unsigned int, uint64_t,
>>>>> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12
>>>>> 04:31:02.627138
>>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398:
>>>>> FAILED
>>>>> assert(0 == "allocate failed... wtf")
>>>>>
>>>>>  ceph version v10.0.4-6936-gc7da2f7
>>>>> (c7da2f7c869694246650a9276a2b67aed9bf818f)
>>>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>>> const*)+0x85) [0xd4cb75]
>>>>>  2: (BlueFS::_allocate(unsigned int, unsigned long,
>>>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t>
>>>>>> *)+0x760) [0xb98220]
>>>>>  3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
>>>>>  4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
>>>>>  5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
>>>>>  6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
>>>>>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
>>>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
>>>>> unsigned long, bool)+0x1456) [0xbfdb96]
>>>>>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>>>>> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
>>>>>  9:
>>>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::
>>>>> TransactionImpl>)+0x6b)
>>>>>
>>>>> [0xb3df2b]
>>>>>  10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
>>>>>  11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
>>>>>  12: (()+0x7dc5) [0x7f0d185c4dc5]
>>>>>  13: (clone()+0x6d) [0x7f0d164bf28d]
>>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>> needed to interpret this.
>>>>>
>>>>>
>>>>> On 07/12/2016 02:13 AM, Somnath Roy wrote:
>>>>>> Thanks Mark !
>>>>>> Yes, quite similar result I am also seeing for 4K RW. BTW, did you
>>>>>> get chance to try out the rocksdb tuning I posted earlier ? It may
>>>>>> reduce the stalls in your environment.
>>>>>>
>>>>>> Regards
>>>>>> Somnath
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>>> Sent: Tuesday, July 12, 2016 12:03 AM
>>>>>> To: ceph-devel
>>>>>> Subject: bluestore onode diet and encoding overhead
>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> With Igor's patch last week I was able to get some bluestore
>>>>>> performance runs in without segfaulting and started looking int
>>>>>> the results.
>>>>>> Somewhere along the line we really screwed up read performance,
>>>>>> but that's another topic.  Right now I want to focus on random writes.
>>>>>> Before we put the onode on a diet we were seeing massive amounts
>>>>>> of read traffic in RocksDB during compaction that caused write
>>>>>> stalls during 4K random writes.  Random write performance on fast
>>>>>> hardware like NVMe devices was often below filestore at anything
>>>>>> other than very large IO sizes.  This was largely due to the size
>>>>>> of the onode compounded with RocksDB's tendency toward read and
>>>>>> write amplification.
>>>>>>
>>>>>> The new test results look very promising.  We've dramatically
>>>>>> improved performance of random writes at most IO sizes, so that
>>>>>> they are now typically quite a bit higher than both filestore and
>>>>>> older bluestore code.  Unfortunately for very small IO sizes
>>>>>> performance hasn't improved much.  We are no longer seeing huge
>>>>>> amounts of RocksDB read traffic and fewer write stalls.  We are
>>>>>> however seeing huge memory usage (~9GB RSS per OSD) and very high
>>>>>> CPU usage.  I think this confirms some of the memory issues
>>>>>> somnath was continuing to see.  I don't think it's a leak exactly
>>>>>> based on how the OSDs were behaving, but we need to run through massif still to be sure.
>>>>>>
>>>>>> I ended up spending some time tonight with perf and digging
>>>>>> through the encode code.  I wrote up some notes with graphs and
>>>>>> code snippets and decided to put them up on the web.  Basically
>>>>>> some of the encoding changes we implemented last month to reduce
>>>>>> the onode size also appear to result in more buffer::list appends
>>>>>> and the associated overhead.
>>>>>> I've been trying to think through ways to improve the situation
>>>>>> and thought other people might have some ideas too.  Here's a link
>>>>>> to the short writeup:
>>>>>>
>>>>>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?
>>>>>> usp=sharing
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Mark
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>> message is intended only for the use of the designated
>>>>>> recipient(s) named above. If the reader of this message is not the
>>>>>> intended recipient, you are hereby notified that you have received
>>>>>> this message in error and that any review, dissemination,
>>>>>> distribution, or copying of this message is strictly prohibited.
>>>>>> If you have received this communication in error, please notify
>>>>>> the sender by telephone or e-mail (as shown above) immediately and
>>>>>> destroy any and all copies of this message in your possession
>>>>>> (whether hard copies or electronically stored copies).
>>>>>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay ʇڙ ,j
>>>>>>   f   h   z  w
>    j:+v   w j m         zZ+     ݢj"  !tml=
>>>>>>
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> �{.n�+�������+%��lzwm��b�맲��r��yǩ�ׯzX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?����&�)ߢf


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-12 15:20 ` Allen Samuels
@ 2016-07-12 15:37   ` Mark Nelson
  2016-07-12 21:15     ` Allen Samuels
  2016-07-13  1:50   ` Sage Weil
  1 sibling, 1 reply; 39+ messages in thread
From: Mark Nelson @ 2016-07-12 15:37 UTC (permalink / raw)
  To: Allen Samuels, ceph-devel



On 07/12/2016 10:20 AM, Allen Samuels wrote:
> Good analysis.
>
> My original comments about putting the oNode on a diet included the idea of a "custom" encode/decode path for certain high-usage cases. At the time, Sage resisted going down that path hoping that a more optimized generic case would get the job done. Your analysis shows that while we've achieved significant space reduction this has come at the expense of CPU time -- which dominates small object performance (I suspect that eventually we'd discover that the variable length decode path would be responsible for a substantial read performance degradation also -- which may or may not be part of the read performance drop-off that you're seeing). This isn't a surprising result, though it is unfortunate.
>
> I believe we need to revisit the idea of custom encode/decode paths for high-usage cases, only now the gains need to be focused on CPU utilization as well as space efficiency.

I'm not against it, but it might be worth at least a quick attempt at 
preallocating the append_buffer and/or Piotr's idea to directly memcpy 
without doing the append at all.  It may be that helps quite a bit 
(though perhaps it's not enough in the long run).

A couple of other thoughts:

I still think SIMD encode approaches are interesting if we can lay data 
out in memory in a friendly way (This feels like it might be painful 
though):

http://arxiv.org/abs/1209.2137

But on the other hand, Kenton Varda who was previously a primary author 
on google's protocol buffers ended up doing something a little different 
than varint:

https://capnproto.org/encoding.html

Look specifically at the packing section.  It looks somewhat attractive 
to me.

Mark

>
> I believe this activity can also address some of the memory consumption issues that we're seeing now. I believe that the current lextent/blob/pextent usage of standard STL maps is both space and time inefficient -- in a place where it matters a lot. Sage has already discussed usage of something like flat_map from the boost library as a way to reduce the memory overhead, etc. I believe this is the right direction.
>
> Where are we on getting boost into our build?
>
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, Milpitas, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Tuesday, July 12, 2016 12:03 AM
>> To: ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: bluestore onode diet and encoding overhead
>>
>> Hi All,
>>
>> With Igor's patch last week I was able to get some bluestore performance
>> runs in without segfaulting and started looking int the results.
>> Somewhere along the line we really screwed up read performance, but
>> that's another topic.  Right now I want to focus on random writes.
>> Before we put the onode on a diet we were seeing massive amounts of read
>> traffic in RocksDB during compaction that caused write stalls during 4K
>> random writes.  Random write performance on fast hardware like NVMe
>> devices was often below filestore at anything other than very large IO sizes.
>> This was largely due to the size of the onode compounded with RocksDB's
>> tendency toward read and write amplification.
>>
>> The new test results look very promising.  We've dramatically improved
>> performance of random writes at most IO sizes, so that they are now
>> typically quite a bit higher than both filestore and older bluestore code.
>> Unfortunately for very small IO sizes performance hasn't improved much.
>> We are no longer seeing huge amounts of RocksDB read traffic and fewer
>> write stalls.  We are however seeing huge memory usage (~9GB RSS per
>> OSD) and very high CPU usage.  I think this confirms some of the memory
>> issues somnath was continuing to see.  I don't think it's a leak exactly based
>> on how the OSDs were behaving, but we need to run through massif still to
>> be sure.
>>
>> I ended up spending some time tonight with perf and digging through the
>> encode code.  I wrote up some notes with graphs and code snippets and
>> decided to put them up on the web.  Basically some of the encoding changes
>> we implemented last month to reduce the onode size also appear to result in
>> more buffer::list appends and the associated overhead.
>> I've been trying to think through ways to improve the situation and thought
>> other people might have some ideas too.  Here's a link to the short writeup:
>>
>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?us
>> p=sharing
>>
>> Thanks,
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>> body of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-12 15:36                 ` Somnath Roy
@ 2016-07-12 15:46                   ` Mark Nelson
  2016-07-12 20:48                     ` Mark Nelson
  0 siblings, 1 reply; 39+ messages in thread
From: Mark Nelson @ 2016-07-12 15:46 UTC (permalink / raw)
  To: Somnath Roy, Igor Fedotov, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 11431 bytes --]

I'm seeing the majority of memory growth happening during random reads 
still.  After looking through the massif output, it looks like it may be 
associated with the bufferptr creation in KernelDevice::read here:

https://github.com/ceph/ceph/blob/master/src/os/bluestore/KernelDevice.cc#L477

On 07/12/2016 10:36 AM, Somnath Roy wrote:
> << And another observation - the issue isn't reproduced with stupid allocator hence I suspect some bug in bitmap one
> I was about to co-relate that , it seems a bug in Bitamp allocator then.
> I need to check the memory growth is also related to Bitmap allocator related or not. I will do some digging.
>
> Thanks & Regards
> Somnath
> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Tuesday, July 12, 2016 8:32 AM
> To: Somnath Roy; Mark Nelson; ceph-devel
> Subject: Re: bluestore onode diet and encoding overhead
>
> Somnath,
>
> yeah,  you're right about db partition is getting out of space:
>
> -876> 2016-07-12 18:19:26.133795 7f8e6ddb7700 10 bluefs get_usage bdev 0
> free 0 (0 B) / 268431360 (255 MB), used 100%
> -875> 2016-07-12 18:19:26.133796 7f8e6ddb7700 10 bluefs get_usage bdev 1
> free 193986560 (185 MB) / 268427264 (255 MB), used 27%
> -874> 2016-07-12 18:19:26.133797 7f8e6ddb7700 10 bluefs get_usage bdev 2
> free 1073741824 (1024 MB) / 1074782208 (1024 MB), used 0%
>
> And I don't see much RAM consumption in this case.
>
> But the curious thing about my test case is that it shouldn't increase
> amount of metadata written as I'm doing writes within the first megabyte
> only( see fio script I posted last week).
>
> Looks like somebody wastes DB space - usage at bdev 0 is constantly
> growing while I'm running the test case...
> And another observation - the issue isn't reproduced with stupid
> allocator hence I suspect some bug in bitmap one...
>
> Thanks,
> Igor
>
>
> On 12.07.2016 18:14, Somnath Roy wrote:
>> Mark,
>> Recently, the default allocator is changed to Bitmap and I saw it is returning < 0 return value only in the following case.
>>
>>    count = m_bit_alloc->alloc_blocks_res(nblks, &start_blk);
>>    if (count == 0) {
>>      return -ENOSPC;
>>    }
>>
>> So, it seems it may not be the memory but db partition is getting out of space (?). I never faced it so far as I was running with 100GB of db partition may be.
>> The amount of metadata write going on to the db even after onode diet is starting from ~1K and over time it is reaching > 4k or so (I checked for 4K RW). It is growing as extents are growing. So, 8 GB may not be enough.
>> If this is true, next challenge is , how to automatically (or document) the size of rocksdb db partition based on the data partition size. For example, in the ZS case, we have calculated that we need ~9G db space per TB. We need to do similar calculation for rocksbd as well.
>>
>> Thanks & Regards
>> Somnath
>>
>>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mnelson@redhat.com]
>> Sent: Tuesday, July 12, 2016 6:03 AM
>> To: Igor Fedotov; Somnath Roy; ceph-devel
>> Subject: Re: bluestore onode diet and encoding overhead
>>
>> In this case I'm assigning per OSD:
>>
>> 1G Data (basically the top level OSD dir) 1G WAL 8G DB 140G Block
>>
>> Mark
>>
>> On 07/12/2016 07:57 AM, Igor Fedotov wrote:
>>> Mark,
>>>
>>> you can find my post named 'yet another assertion in bluestore during
>>> random write' last week. It contains steps to reproduce in my case.
>>>
>>> Also I did some investigations (still incomplete though) with tuning
>>> 'bluestore block db size' and 'bluestore block wal size'. Setting both
>>> to 256M fixes the issue for me.
>>>
>>> But I'm still uncertain if that's a bug or just inappropriate settings...
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 12.07.2016 15:48, Mark Nelson wrote:
>>>> Oh, that's good to know!  Have you tracked it down at all?  I noticed
>>>> pretty extreme memory usage on the OSDs still, so that might be part
>>>> of it.  I'm doing a massif run now.
>>>>
>>>> Mark
>>>>
>>>> On 07/12/2016 07:40 AM, Igor Fedotov wrote:
>>>>> That's similar to what I have while running my test case with vstart...
>>>>> Without Somnath's settings though..
>>>>>
>>>>>
>>>>> On 12.07.2016 15:34, Mark Nelson wrote:
>>>>>> Hi Somnath,
>>>>>>
>>>>>> I accidentally screwed up my first run with your settings but reran
>>>>>> last night.  With your tuning the OSDs are failing to allocate to
>>>>>> bdev0 after about 30 minutes of testing:
>>>>>>
>>>>>> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed
>>>>>> to allocate 1048576 on bdev 0, free 0; fallback to bdev 1
>>>>>>
>>>>>> They are able to continue running, but ultimately this leads to an
>>>>>> assert later on.  I wonder if it's not compacting fast enough and
>>>>>> ends up consuming the entire disk with stale metadata.
>>>>>>
>>>>>> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1
>>>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In
>>>>>> function 'int BlueFS::_allocate(unsigned int, uint64_t,
>>>>>> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12
>>>>>> 04:31:02.627138
>>>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398:
>>>>>> FAILED
>>>>>> assert(0 == "allocate failed... wtf")
>>>>>>
>>>>>>   ceph version v10.0.4-6936-gc7da2f7
>>>>>> (c7da2f7c869694246650a9276a2b67aed9bf818f)
>>>>>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>>>> const*)+0x85) [0xd4cb75]
>>>>>>   2: (BlueFS::_allocate(unsigned int, unsigned long,
>>>>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t>
>>>>>>> *)+0x760) [0xb98220]
>>>>>>   3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
>>>>>>   4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
>>>>>>   5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
>>>>>>   6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
>>>>>>   7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
>>>>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
>>>>>> unsigned long, bool)+0x1456) [0xbfdb96]
>>>>>>   8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>>>>>> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
>>>>>>   9:
>>>>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::
>>>>>> TransactionImpl>)+0x6b)
>>>>>>
>>>>>> [0xb3df2b]
>>>>>>   10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
>>>>>>   11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
>>>>>>   12: (()+0x7dc5) [0x7f0d185c4dc5]
>>>>>>   13: (clone()+0x6d) [0x7f0d164bf28d]
>>>>>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>>> needed to interpret this.
>>>>>>
>>>>>>
>>>>>> On 07/12/2016 02:13 AM, Somnath Roy wrote:
>>>>>>> Thanks Mark !
>>>>>>> Yes, quite similar result I am also seeing for 4K RW. BTW, did you
>>>>>>> get chance to try out the rocksdb tuning I posted earlier ? It may
>>>>>>> reduce the stalls in your environment.
>>>>>>>
>>>>>>> Regards
>>>>>>> Somnath
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>>>> Sent: Tuesday, July 12, 2016 12:03 AM
>>>>>>> To: ceph-devel
>>>>>>> Subject: bluestore onode diet and encoding overhead
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> With Igor's patch last week I was able to get some bluestore
>>>>>>> performance runs in without segfaulting and started looking int
>>>>>>> the results.
>>>>>>> Somewhere along the line we really screwed up read performance,
>>>>>>> but that's another topic.  Right now I want to focus on random writes.
>>>>>>> Before we put the onode on a diet we were seeing massive amounts
>>>>>>> of read traffic in RocksDB during compaction that caused write
>>>>>>> stalls during 4K random writes.  Random write performance on fast
>>>>>>> hardware like NVMe devices was often below filestore at anything
>>>>>>> other than very large IO sizes.  This was largely due to the size
>>>>>>> of the onode compounded with RocksDB's tendency toward read and
>>>>>>> write amplification.
>>>>>>>
>>>>>>> The new test results look very promising.  We've dramatically
>>>>>>> improved performance of random writes at most IO sizes, so that
>>>>>>> they are now typically quite a bit higher than both filestore and
>>>>>>> older bluestore code.  Unfortunately for very small IO sizes
>>>>>>> performance hasn't improved much.  We are no longer seeing huge
>>>>>>> amounts of RocksDB read traffic and fewer write stalls.  We are
>>>>>>> however seeing huge memory usage (~9GB RSS per OSD) and very high
>>>>>>> CPU usage.  I think this confirms some of the memory issues
>>>>>>> somnath was continuing to see.  I don't think it's a leak exactly
>>>>>>> based on how the OSDs were behaving, but we need to run through massif still to be sure.
>>>>>>>
>>>>>>> I ended up spending some time tonight with perf and digging
>>>>>>> through the encode code.  I wrote up some notes with graphs and
>>>>>>> code snippets and decided to put them up on the web.  Basically
>>>>>>> some of the encoding changes we implemented last month to reduce
>>>>>>> the onode size also appear to result in more buffer::list appends
>>>>>>> and the associated overhead.
>>>>>>> I've been trying to think through ways to improve the situation
>>>>>>> and thought other people might have some ideas too.  Here's a link
>>>>>>> to the short writeup:
>>>>>>>
>>>>>>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?
>>>>>>> usp=sharing
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Mark
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>> message is intended only for the use of the designated
>>>>>>> recipient(s) named above. If the reader of this message is not the
>>>>>>> intended recipient, you are hereby notified that you have received
>>>>>>> this message in error and that any review, dissemination,
>>>>>>> distribution, or copying of this message is strictly prohibited.
>>>>>>> If you have received this communication in error, please notify
>>>>>>> the sender by telephone or e-mail (as shown above) immediately and
>>>>>>> destroy any and all copies of this message in your possession
>>>>>>> (whether hard copies or electronically stored copies).
>>>>>>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j
>>>>>>>    f   h   z \x1e w
>>     j:+v   w j m         zZ+     ݢj"  !tml=
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>

[-- Attachment #2: osd.0.massif.out --]
[-- Type: chemical/x-gulp, Size: 81728 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-12 15:14             ` Somnath Roy
  2016-07-12 15:31               ` Igor Fedotov
  2016-07-12 15:37               ` Varada Kari
@ 2016-07-12 16:56               ` Sage Weil
  2016-07-12 16:57                 ` Sage Weil
  2016-07-12 17:50                 ` Allen Samuels
  2 siblings, 2 replies; 39+ messages in thread
From: Sage Weil @ 2016-07-12 16:56 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Mark Nelson, Igor Fedotov, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 10031 bytes --]

On Tue, 12 Jul 2016, Somnath Roy wrote:
> Mark,
> Recently, the default allocator is changed to Bitmap and I saw it is 
> returning < 0 return value only in the following case.
> 
>   count = m_bit_alloc->alloc_blocks_res(nblks, &start_blk);
>   if (count == 0) {
>     return -ENOSPC;
>   }
> 
> So, it seems it may not be the memory but db partition is getting out of 
> space (?). I never faced it so far as I was running with 100GB of db 
> partition may be. The amount of metadata write going on to the db even 
> after onode diet is starting from ~1K and over time it is reaching > 4k 
> or so (I checked for 4K RW). It is growing as extents are growing. So, 8 
> GB may not be enough. If this is true, next challenge is , how to 
> automatically (or document) the size of rocksdb db partition based on 
> the data partition size. For example, in the ZS case, we have calculated 
> that we need ~9G db space per TB. We need to do similar calculation for 
> rocksbd as well.

We can precalculate or otherwise pre-size the db partition because we 
don't know what kind of data the user is going to store, and that data 
might even be 100% omap.  This is why BlueStore and BlueFS balance their 
free space--so that the bluefs/db usage can grow and shrink dynamically as 
needed.

We'll need to implement something similar for ZS.

sage


 > 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, July 12, 2016 6:03 AM
> To: Igor Fedotov; Somnath Roy; ceph-devel
> Subject: Re: bluestore onode diet and encoding overhead
> 
> In this case I'm assigning per OSD:
> 
> 1G Data (basically the top level OSD dir) 1G WAL 8G DB 140G Block
> 
> Mark
> 
> On 07/12/2016 07:57 AM, Igor Fedotov wrote:
> > Mark,
> >
> > you can find my post named 'yet another assertion in bluestore during
> > random write' last week. It contains steps to reproduce in my case.
> >
> > Also I did some investigations (still incomplete though) with tuning
> > 'bluestore block db size' and 'bluestore block wal size'. Setting both
> > to 256M fixes the issue for me.
> >
> > But I'm still uncertain if that's a bug or just inappropriate settings...
> >
> >
> > Thanks,
> >
> > Igor
> >
> >
> > On 12.07.2016 15:48, Mark Nelson wrote:
> >> Oh, that's good to know!  Have you tracked it down at all?  I noticed
> >> pretty extreme memory usage on the OSDs still, so that might be part
> >> of it.  I'm doing a massif run now.
> >>
> >> Mark
> >>
> >> On 07/12/2016 07:40 AM, Igor Fedotov wrote:
> >>> That's similar to what I have while running my test case with vstart...
> >>> Without Somnath's settings though..
> >>>
> >>>
> >>> On 12.07.2016 15:34, Mark Nelson wrote:
> >>>> Hi Somnath,
> >>>>
> >>>> I accidentally screwed up my first run with your settings but reran
> >>>> last night.  With your tuning the OSDs are failing to allocate to
> >>>> bdev0 after about 30 minutes of testing:
> >>>>
> >>>> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed
> >>>> to allocate 1048576 on bdev 0, free 0; fallback to bdev 1
> >>>>
> >>>> They are able to continue running, but ultimately this leads to an
> >>>> assert later on.  I wonder if it's not compacting fast enough and
> >>>> ends up consuming the entire disk with stale metadata.
> >>>>
> >>>> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1
> >>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In
> >>>> function 'int BlueFS::_allocate(unsigned int, uint64_t,
> >>>> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12
> >>>> 04:31:02.627138
> >>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398:
> >>>> FAILED
> >>>> assert(0 == "allocate failed... wtf")
> >>>>
> >>>>  ceph version v10.0.4-6936-gc7da2f7
> >>>> (c7da2f7c869694246650a9276a2b67aed9bf818f)
> >>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >>>> const*)+0x85) [0xd4cb75]
> >>>>  2: (BlueFS::_allocate(unsigned int, unsigned long,
> >>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t>
> >>>> >*)+0x760) [0xb98220]
> >>>>  3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
> >>>>  4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
> >>>>  5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
> >>>>  6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
> >>>>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
> >>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
> >>>> unsigned long, bool)+0x1456) [0xbfdb96]
> >>>>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
> >>>> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
> >>>>  9:
> >>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::
> >>>> TransactionImpl>)+0x6b)
> >>>>
> >>>> [0xb3df2b]
> >>>>  10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
> >>>>  11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
> >>>>  12: (()+0x7dc5) [0x7f0d185c4dc5]
> >>>>  13: (clone()+0x6d) [0x7f0d164bf28d]
> >>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> >>>> needed to interpret this.
> >>>>
> >>>>
> >>>> On 07/12/2016 02:13 AM, Somnath Roy wrote:
> >>>>> Thanks Mark !
> >>>>> Yes, quite similar result I am also seeing for 4K RW. BTW, did you
> >>>>> get chance to try out the rocksdb tuning I posted earlier ? It may
> >>>>> reduce the stalls in your environment.
> >>>>>
> >>>>> Regards
> >>>>> Somnath
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: ceph-devel-owner@vger.kernel.org
> >>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> >>>>> Sent: Tuesday, July 12, 2016 12:03 AM
> >>>>> To: ceph-devel
> >>>>> Subject: bluestore onode diet and encoding overhead
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> With Igor's patch last week I was able to get some bluestore
> >>>>> performance runs in without segfaulting and started looking int
> >>>>> the results.
> >>>>> Somewhere along the line we really screwed up read performance,
> >>>>> but that's another topic.  Right now I want to focus on random writes.
> >>>>> Before we put the onode on a diet we were seeing massive amounts
> >>>>> of read traffic in RocksDB during compaction that caused write
> >>>>> stalls during 4K random writes.  Random write performance on fast
> >>>>> hardware like NVMe devices was often below filestore at anything
> >>>>> other than very large IO sizes.  This was largely due to the size
> >>>>> of the onode compounded with RocksDB's tendency toward read and
> >>>>> write amplification.
> >>>>>
> >>>>> The new test results look very promising.  We've dramatically
> >>>>> improved performance of random writes at most IO sizes, so that
> >>>>> they are now typically quite a bit higher than both filestore and
> >>>>> older bluestore code.  Unfortunately for very small IO sizes
> >>>>> performance hasn't improved much.  We are no longer seeing huge
> >>>>> amounts of RocksDB read traffic and fewer write stalls.  We are
> >>>>> however seeing huge memory usage (~9GB RSS per OSD) and very high
> >>>>> CPU usage.  I think this confirms some of the memory issues
> >>>>> somnath was continuing to see.  I don't think it's a leak exactly
> >>>>> based on how the OSDs were behaving, but we need to run through massif still to be sure.
> >>>>>
> >>>>> I ended up spending some time tonight with perf and digging
> >>>>> through the encode code.  I wrote up some notes with graphs and
> >>>>> code snippets and decided to put them up on the web.  Basically
> >>>>> some of the encoding changes we implemented last month to reduce
> >>>>> the onode size also appear to result in more buffer::list appends
> >>>>> and the associated overhead.
> >>>>> I've been trying to think through ways to improve the situation
> >>>>> and thought other people might have some ideas too.  Here's a link
> >>>>> to the short writeup:
> >>>>>
> >>>>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?
> >>>>> usp=sharing
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Mark
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>>> PLEASE NOTE: The information contained in this electronic mail
> >>>>> message is intended only for the use of the designated
> >>>>> recipient(s) named above. If the reader of this message is not the
> >>>>> intended recipient, you are hereby notified that you have received
> >>>>> this message in error and that any review, dissemination,
> >>>>> distribution, or copying of this message is strictly prohibited.
> >>>>> If you have received this communication in error, please notify
> >>>>> the sender by telephone or e-mail (as shown above) immediately and
> >>>>> destroy any and all copies of this message in your possession
> >>>>> (whether hard copies or electronically stored copies).
> >>>>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j
> >>>>>   f   h   z \x1e w
>    j:+v   w j m         zZ+     ݢj"  !tml=
> >>>>>
> >>>>>
> >>>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe
> >>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>
> >
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-12 16:56               ` Sage Weil
@ 2016-07-12 16:57                 ` Sage Weil
  2016-07-12 17:06                   ` Somnath Roy
  2016-07-12 17:50                 ` Allen Samuels
  1 sibling, 1 reply; 39+ messages in thread
From: Sage Weil @ 2016-07-12 16:57 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Mark Nelson, Igor Fedotov, ceph-devel

On Tue, 12 Jul 2016, Sage Weil wrote:
> On Tue, 12 Jul 2016, Somnath Roy wrote:
> > Mark,
> > Recently, the default allocator is changed to Bitmap and I saw it is 
> > returning < 0 return value only in the following case.
> > 
> >   count = m_bit_alloc->alloc_blocks_res(nblks, &start_blk);
> >   if (count == 0) {
> >     return -ENOSPC;
> >   }
> > 
> > So, it seems it may not be the memory but db partition is getting out of 
> > space (?). I never faced it so far as I was running with 100GB of db 
> > partition may be. The amount of metadata write going on to the db even 
> > after onode diet is starting from ~1K and over time it is reaching > 4k 
> > or so (I checked for 4K RW). It is growing as extents are growing. So, 8 
> > GB may not be enough. If this is true, next challenge is , how to 
> > automatically (or document) the size of rocksdb db partition based on 
> > the data partition size. For example, in the ZS case, we have calculated 
> > that we need ~9G db space per TB. We need to do similar calculation for 
> > rocksbd as well.
> 
> We can precalculate or otherwise pre-size the db partition because we 
     ^
     can't

> don't know what kind of data the user is going to store, and that data 
> might even be 100% omap.  This is why BlueStore and BlueFS balance their 
> free space--so that the bluefs/db usage can grow and shrink dynamically as 
> needed.
> 
> We'll need to implement something similar for ZS.
> 
> sage

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-12 16:57                 ` Sage Weil
@ 2016-07-12 17:06                   ` Somnath Roy
  0 siblings, 0 replies; 39+ messages in thread
From: Somnath Roy @ 2016-07-12 17:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, Igor Fedotov, ceph-devel

Yeah , agreed. I forgot user can write any amount of omap data.
We will discuss internally how we can handle that with ZS.

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sage@newdream.net]
Sent: Tuesday, July 12, 2016 9:57 AM
To: Somnath Roy
Cc: Mark Nelson; Igor Fedotov; ceph-devel
Subject: RE: bluestore onode diet and encoding overhead

On Tue, 12 Jul 2016, Sage Weil wrote:
> On Tue, 12 Jul 2016, Somnath Roy wrote:
> > Mark,
> > Recently, the default allocator is changed to Bitmap and I saw it is
> > returning < 0 return value only in the following case.
> >
> >   count = m_bit_alloc->alloc_blocks_res(nblks, &start_blk);
> >   if (count == 0) {
> >     return -ENOSPC;
> >   }
> >
> > So, it seems it may not be the memory but db partition is getting
> > out of space (?). I never faced it so far as I was running with
> > 100GB of db partition may be. The amount of metadata write going on
> > to the db even after onode diet is starting from ~1K and over time
> > it is reaching > 4k or so (I checked for 4K RW). It is growing as
> > extents are growing. So, 8 GB may not be enough. If this is true,
> > next challenge is , how to automatically (or document) the size of
> > rocksdb db partition based on the data partition size. For example,
> > in the ZS case, we have calculated that we need ~9G db space per TB.
> > We need to do similar calculation for rocksbd as well.
>
> We can precalculate or otherwise pre-size the db partition because we
     ^
     can't

> don't know what kind of data the user is going to store, and that data
> might even be 100% omap.  This is why BlueStore and BlueFS balance
> their free space--so that the bluefs/db usage can grow and shrink
> dynamically as needed.
>
> We'll need to implement something similar for ZS.
>
> sage
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-12 16:56               ` Sage Weil
  2016-07-12 16:57                 ` Sage Weil
@ 2016-07-12 17:50                 ` Allen Samuels
  1 sibling, 0 replies; 39+ messages in thread
From: Allen Samuels @ 2016-07-12 17:50 UTC (permalink / raw)
  To: Sage Weil, Somnath Roy; +Cc: Mark Nelson, Igor Fedotov, ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Tuesday, July 12, 2016 9:57 AM
> To: Somnath Roy <Somnath.Roy@sandisk.com>
> Cc: Mark Nelson <mnelson@redhat.com>; Igor Fedotov
> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: bluestore onode diet and encoding overhead
> 
> On Tue, 12 Jul 2016, Somnath Roy wrote:
> > Mark,
> > Recently, the default allocator is changed to Bitmap and I saw it is
> > returning < 0 return value only in the following case.
> >
> >   count = m_bit_alloc->alloc_blocks_res(nblks, &start_blk);
> >   if (count == 0) {
> >     return -ENOSPC;
> >   }
> >
> > So, it seems it may not be the memory but db partition is getting out
> > of space (?). I never faced it so far as I was running with 100GB of
> > db partition may be. The amount of metadata write going on to the db
> > even after onode diet is starting from ~1K and over time it is
> > reaching > 4k or so (I checked for 4K RW). It is growing as extents
> > are growing. So, 8 GB may not be enough. If this is true, next
> > challenge is , how to automatically (or document) the size of rocksdb
> > db partition based on the data partition size. For example, in the ZS
> > case, we have calculated that we need ~9G db space per TB. We need to
> > do similar calculation for rocksbd as well.
> 
> We can precalculate or otherwise pre-size the db partition because we don't
> know what kind of data the user is going to store, and that data might even
> be 100% omap.  This is why BlueStore and BlueFS balance their free space--so
> that the bluefs/db usage can grow and shrink dynamically as needed.
> 
> We'll need to implement something similar for ZS.

Yes, ZS needs some work to properly support dynamic adjustment of the amount of metadata under management. The sharing of the media is one part of that problem, there are other internal issues that will need to get fixed which is the largest part of the problem. IMO having a fixed partition size for ZS metadata is something that could be tolerated for a while. My primary concern here is whether the future, dynamically variable code, is backward compatible or not.

I think we need to move to a situation where ZS sits on top of BlueFS, rather than on a raw device. With today's code, you'll have to statically size the ZS database (which will result in a fixed allocation in BlueFS). In the future, variable sizing (again through BlueFS) can be done.
> 
> sage
> 
> 
>  >
> > Thanks & Regards
> > Somnath
> >
> >
> > -----Original Message-----
> > From: Mark Nelson [mailto:mnelson@redhat.com]
> > Sent: Tuesday, July 12, 2016 6:03 AM
> > To: Igor Fedotov; Somnath Roy; ceph-devel
> > Subject: Re: bluestore onode diet and encoding overhead
> >
> > In this case I'm assigning per OSD:
> >
> > 1G Data (basically the top level OSD dir) 1G WAL 8G DB 140G Block
> >
> > Mark
> >
> > On 07/12/2016 07:57 AM, Igor Fedotov wrote:
> > > Mark,
> > >
> > > you can find my post named 'yet another assertion in bluestore
> > > during random write' last week. It contains steps to reproduce in my case.
> > >
> > > Also I did some investigations (still incomplete though) with tuning
> > > 'bluestore block db size' and 'bluestore block wal size'. Setting
> > > both to 256M fixes the issue for me.
> > >
> > > But I'm still uncertain if that's a bug or just inappropriate settings...
> > >
> > >
> > > Thanks,
> > >
> > > Igor
> > >
> > >
> > > On 12.07.2016 15:48, Mark Nelson wrote:
> > >> Oh, that's good to know!  Have you tracked it down at all?  I
> > >> noticed pretty extreme memory usage on the OSDs still, so that
> > >> might be part of it.  I'm doing a massif run now.
> > >>
> > >> Mark
> > >>
> > >> On 07/12/2016 07:40 AM, Igor Fedotov wrote:
> > >>> That's similar to what I have while running my test case with vstart...
> > >>> Without Somnath's settings though..
> > >>>
> > >>>
> > >>> On 12.07.2016 15:34, Mark Nelson wrote:
> > >>>> Hi Somnath,
> > >>>>
> > >>>> I accidentally screwed up my first run with your settings but
> > >>>> reran last night.  With your tuning the OSDs are failing to
> > >>>> allocate to
> > >>>> bdev0 after about 30 minutes of testing:
> > >>>>
> > >>>> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate
> > >>>> failed to allocate 1048576 on bdev 0, free 0; fallback to bdev 1
> > >>>>
> > >>>> They are able to continue running, but ultimately this leads to
> > >>>> an assert later on.  I wonder if it's not compacting fast enough
> > >>>> and ends up consuming the entire disk with stale metadata.
> > >>>>
> > >>>> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1
> > >>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In
> > >>>> function 'int BlueFS::_allocate(unsigned int, uint64_t,
> > >>>> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time
> > >>>> 2016-07-12
> > >>>> 04:31:02.627138
> > >>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398:
> > >>>> FAILED
> > >>>> assert(0 == "allocate failed... wtf")
> > >>>>
> > >>>>  ceph version v10.0.4-6936-gc7da2f7
> > >>>> (c7da2f7c869694246650a9276a2b67aed9bf818f)
> > >>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > >>>> const*)+0x85) [0xd4cb75]
> > >>>>  2: (BlueFS::_allocate(unsigned int, unsigned long,
> > >>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t>
> > >>>> >*)+0x760) [0xb98220]
> > >>>>  3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
> > >>>>  4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
> > >>>>  5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
> > >>>>  6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
> > >>>>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
> > >>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
> > >>>> unsigned long, bool)+0x1456) [0xbfdb96]
> > >>>>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
> > >>>> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
> > >>>>  9:
> > >>>>
> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::
> > >>>> TransactionImpl>)+0x6b)
> > >>>>
> > >>>> [0xb3df2b]
> > >>>>  10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
> > >>>>  11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
> > >>>>  12: (()+0x7dc5) [0x7f0d185c4dc5]
> > >>>>  13: (clone()+0x6d) [0x7f0d164bf28d]
> > >>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>`
> > >>>> is needed to interpret this.
> > >>>>
> > >>>>
> > >>>> On 07/12/2016 02:13 AM, Somnath Roy wrote:
> > >>>>> Thanks Mark !
> > >>>>> Yes, quite similar result I am also seeing for 4K RW. BTW, did
> > >>>>> you get chance to try out the rocksdb tuning I posted earlier ?
> > >>>>> It may reduce the stalls in your environment.
> > >>>>>
> > >>>>> Regards
> > >>>>> Somnath
> > >>>>>
> > >>>>> -----Original Message-----
> > >>>>> From: ceph-devel-owner@vger.kernel.org
> > >>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark
> > >>>>> Nelson
> > >>>>> Sent: Tuesday, July 12, 2016 12:03 AM
> > >>>>> To: ceph-devel
> > >>>>> Subject: bluestore onode diet and encoding overhead
> > >>>>>
> > >>>>> Hi All,
> > >>>>>
> > >>>>> With Igor's patch last week I was able to get some bluestore
> > >>>>> performance runs in without segfaulting and started looking int
> > >>>>> the results.
> > >>>>> Somewhere along the line we really screwed up read performance,
> > >>>>> but that's another topic.  Right now I want to focus on random
> writes.
> > >>>>> Before we put the onode on a diet we were seeing massive
> amounts
> > >>>>> of read traffic in RocksDB during compaction that caused write
> > >>>>> stalls during 4K random writes.  Random write performance on
> > >>>>> fast hardware like NVMe devices was often below filestore at
> > >>>>> anything other than very large IO sizes.  This was largely due
> > >>>>> to the size of the onode compounded with RocksDB's tendency
> > >>>>> toward read and write amplification.
> > >>>>>
> > >>>>> The new test results look very promising.  We've dramatically
> > >>>>> improved performance of random writes at most IO sizes, so that
> > >>>>> they are now typically quite a bit higher than both filestore
> > >>>>> and older bluestore code.  Unfortunately for very small IO sizes
> > >>>>> performance hasn't improved much.  We are no longer seeing huge
> > >>>>> amounts of RocksDB read traffic and fewer write stalls.  We are
> > >>>>> however seeing huge memory usage (~9GB RSS per OSD) and very
> > >>>>> high CPU usage.  I think this confirms some of the memory issues
> > >>>>> somnath was continuing to see.  I don't think it's a leak
> > >>>>> exactly based on how the OSDs were behaving, but we need to run
> through massif still to be sure.
> > >>>>>
> > >>>>> I ended up spending some time tonight with perf and digging
> > >>>>> through the encode code.  I wrote up some notes with graphs and
> > >>>>> code snippets and decided to put them up on the web.  Basically
> > >>>>> some of the encoding changes we implemented last month to
> reduce
> > >>>>> the onode size also appear to result in more buffer::list
> > >>>>> appends and the associated overhead.
> > >>>>> I've been trying to think through ways to improve the situation
> > >>>>> and thought other people might have some ideas too.  Here's a
> > >>>>> link to the short writeup:
> > >>>>>
> > >>>>>
> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?
> > >>>>> usp=sharing
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Mark
> > >>>>> --
> > >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > >>>>> in the body of a message to majordomo@vger.kernel.org More
> > >>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
> > >>>>> PLEASE NOTE: The information contained in this electronic mail
> > >>>>> message is intended only for the use of the designated
> > >>>>> recipient(s) named above. If the reader of this message is not
> > >>>>> the intended recipient, you are hereby notified that you have
> > >>>>> received this message in error and that any review,
> > >>>>> dissemination, distribution, or copying of this message is strictly
> prohibited.
> > >>>>> If you have received this communication in error, please notify
> > >>>>> the sender by telephone or e-mail (as shown above) immediately
> > >>>>> and destroy any and all copies of this message in your
> > >>>>> possession (whether hard copies or electronically stored copies).
> > >>>>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j
> > >>>>>   f   h   z \x1e w
> >    j:+v   w j m         zZ+     ݢj"  !tml=
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>> --
> > >>>> To unsubscribe from this list: send the line "unsubscribe
> > >>>> ceph-devel" in the body of a message to
> majordomo@vger.kernel.org
> > >>>> More majordomo info at http://vger.kernel.org/majordomo-
> info.html
> > >>>
> > >
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> > N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j
> ??f???h?????\x1e?w???
> 
> ???j:+v???w???????? ????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-12 15:46                   ` Mark Nelson
@ 2016-07-12 20:48                     ` Mark Nelson
  0 siblings, 0 replies; 39+ messages in thread
From: Mark Nelson @ 2016-07-12 20:48 UTC (permalink / raw)
  To: Somnath Roy, Igor Fedotov, ceph-devel

Ok, good news and bad news.  I can confirm that stupid allocator is 
letting me get past where I hit the allocation issues with Somnaths's 
RockSDB tuning last time.  Memory usage is still crazy and continues to 
grow most during random reads (It does release periodically, but never 
enough to return to previous levels).

Random write performance is generally lower than the default tuning with 
bitmap allocator.  Seq write performance is much better at small/medium 
IO sizes though (it had regressed since jewel). Not sure if that's 
Somnath's tuning or stupid allocator vs bitmap.  I'll to have to try a 
stupid allocator run without Somnath's tuning to check.  Read/Randread 
performance is still pretty terrible (actually even worse).

Mark

On 07/12/2016 10:46 AM, Mark Nelson wrote:
> I'm seeing the majority of memory growth happening during random reads
> still.  After looking through the massif output, it looks like it may be
> associated with the bufferptr creation in KernelDevice::read here:
>
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/KernelDevice.cc#L477
>
>
> On 07/12/2016 10:36 AM, Somnath Roy wrote:
>> << And another observation - the issue isn't reproduced with stupid
>> allocator hence I suspect some bug in bitmap one
>> I was about to co-relate that , it seems a bug in Bitamp allocator then.
>> I need to check the memory growth is also related to Bitmap allocator
>> related or not. I will do some digging.
>>
>> Thanks & Regards
>> Somnath
>> -----Original Message-----
>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>> Sent: Tuesday, July 12, 2016 8:32 AM
>> To: Somnath Roy; Mark Nelson; ceph-devel
>> Subject: Re: bluestore onode diet and encoding overhead
>>
>> Somnath,
>>
>> yeah,  you're right about db partition is getting out of space:
>>
>> -876> 2016-07-12 18:19:26.133795 7f8e6ddb7700 10 bluefs get_usage bdev 0
>> free 0 (0 B) / 268431360 (255 MB), used 100%
>> -875> 2016-07-12 18:19:26.133796 7f8e6ddb7700 10 bluefs get_usage bdev 1
>> free 193986560 (185 MB) / 268427264 (255 MB), used 27%
>> -874> 2016-07-12 18:19:26.133797 7f8e6ddb7700 10 bluefs get_usage bdev 2
>> free 1073741824 (1024 MB) / 1074782208 (1024 MB), used 0%
>>
>> And I don't see much RAM consumption in this case.
>>
>> But the curious thing about my test case is that it shouldn't increase
>> amount of metadata written as I'm doing writes within the first megabyte
>> only( see fio script I posted last week).
>>
>> Looks like somebody wastes DB space - usage at bdev 0 is constantly
>> growing while I'm running the test case...
>> And another observation - the issue isn't reproduced with stupid
>> allocator hence I suspect some bug in bitmap one...
>>
>> Thanks,
>> Igor
>>
>>
>> On 12.07.2016 18:14, Somnath Roy wrote:
>>> Mark,
>>> Recently, the default allocator is changed to Bitmap and I saw it is
>>> returning < 0 return value only in the following case.
>>>
>>>    count = m_bit_alloc->alloc_blocks_res(nblks, &start_blk);
>>>    if (count == 0) {
>>>      return -ENOSPC;
>>>    }
>>>
>>> So, it seems it may not be the memory but db partition is getting out
>>> of space (?). I never faced it so far as I was running with 100GB of
>>> db partition may be.
>>> The amount of metadata write going on to the db even after onode diet
>>> is starting from ~1K and over time it is reaching > 4k or so (I
>>> checked for 4K RW). It is growing as extents are growing. So, 8 GB
>>> may not be enough.
>>> If this is true, next challenge is , how to automatically (or
>>> document) the size of rocksdb db partition based on the data
>>> partition size. For example, in the ZS case, we have calculated that
>>> we need ~9G db space per TB. We need to do similar calculation for
>>> rocksbd as well.
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>>
>>> -----Original Message-----
>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>> Sent: Tuesday, July 12, 2016 6:03 AM
>>> To: Igor Fedotov; Somnath Roy; ceph-devel
>>> Subject: Re: bluestore onode diet and encoding overhead
>>>
>>> In this case I'm assigning per OSD:
>>>
>>> 1G Data (basically the top level OSD dir) 1G WAL 8G DB 140G Block
>>>
>>> Mark
>>>
>>> On 07/12/2016 07:57 AM, Igor Fedotov wrote:
>>>> Mark,
>>>>
>>>> you can find my post named 'yet another assertion in bluestore during
>>>> random write' last week. It contains steps to reproduce in my case.
>>>>
>>>> Also I did some investigations (still incomplete though) with tuning
>>>> 'bluestore block db size' and 'bluestore block wal size'. Setting both
>>>> to 256M fixes the issue for me.
>>>>
>>>> But I'm still uncertain if that's a bug or just inappropriate
>>>> settings...
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>> On 12.07.2016 15:48, Mark Nelson wrote:
>>>>> Oh, that's good to know!  Have you tracked it down at all?  I noticed
>>>>> pretty extreme memory usage on the OSDs still, so that might be part
>>>>> of it.  I'm doing a massif run now.
>>>>>
>>>>> Mark
>>>>>
>>>>> On 07/12/2016 07:40 AM, Igor Fedotov wrote:
>>>>>> That's similar to what I have while running my test case with
>>>>>> vstart...
>>>>>> Without Somnath's settings though..
>>>>>>
>>>>>>
>>>>>> On 12.07.2016 15:34, Mark Nelson wrote:
>>>>>>> Hi Somnath,
>>>>>>>
>>>>>>> I accidentally screwed up my first run with your settings but reran
>>>>>>> last night.  With your tuning the OSDs are failing to allocate to
>>>>>>> bdev0 after about 30 minutes of testing:
>>>>>>>
>>>>>>> 2016-07-12 03:48:51.127781 7f0cef8b7700 -1 bluefs _allocate failed
>>>>>>> to allocate 1048576 on bdev 0, free 0; fallback to bdev 1
>>>>>>>
>>>>>>> They are able to continue running, but ultimately this leads to an
>>>>>>> assert later on.  I wonder if it's not compacting fast enough and
>>>>>>> ends up consuming the entire disk with stale metadata.
>>>>>>>
>>>>>>> 2016-07-12 04:31:02.631982 7f0cef8b7700 -1
>>>>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: In
>>>>>>> function 'int BlueFS::_allocate(unsigned int, uint64_t,
>>>>>>> std::vector<bluefs_extent_t>*)' thread 7f0cef8b7700 time 2016-07-12
>>>>>>> 04:31:02.627138
>>>>>>> /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueFS.cc: 1398:
>>>>>>> FAILED
>>>>>>> assert(0 == "allocate failed... wtf")
>>>>>>>
>>>>>>>   ceph version v10.0.4-6936-gc7da2f7
>>>>>>> (c7da2f7c869694246650a9276a2b67aed9bf818f)
>>>>>>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>>>>> const*)+0x85) [0xd4cb75]
>>>>>>>   2: (BlueFS::_allocate(unsigned int, unsigned long,
>>>>>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t>
>>>>>>>> *)+0x760) [0xb98220]
>>>>>>>   3: (BlueFS::_compact_log()+0xd5b) [0xb9b5ab]
>>>>>>>   4: (BlueFS::_maybe_compact_log()+0x2a0) [0xb9c040]
>>>>>>>   5: (BlueFS::sync_metadata()+0x20f) [0xb9d28f]
>>>>>>>   6: (BlueRocksDirectory::Fsync()+0xd) [0xbb2fad]
>>>>>>>   7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
>>>>>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
>>>>>>> unsigned long, bool)+0x1456) [0xbfdb96]
>>>>>>>   8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>>>>>>> rocksdb::WriteBatch*)+0x27) [0xbfe7a7]
>>>>>>>   9:
>>>>>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::
>>>>>>> TransactionImpl>)+0x6b)
>>>>>>>
>>>>>>> [0xb3df2b]
>>>>>>>   10: (BlueStore::_kv_sync_thread()+0xedb) [0xaf935b]
>>>>>>>   11: (BlueStore::KVSyncThread::entry()+0xd) [0xb21e8d]
>>>>>>>   12: (()+0x7dc5) [0x7f0d185c4dc5]
>>>>>>>   13: (clone()+0x6d) [0x7f0d164bf28d]
>>>>>>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>>>> needed to interpret this.
>>>>>>>
>>>>>>>
>>>>>>> On 07/12/2016 02:13 AM, Somnath Roy wrote:
>>>>>>>> Thanks Mark !
>>>>>>>> Yes, quite similar result I am also seeing for 4K RW. BTW, did you
>>>>>>>> get chance to try out the rocksdb tuning I posted earlier ? It may
>>>>>>>> reduce the stalls in your environment.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Somnath
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>>>>> Sent: Tuesday, July 12, 2016 12:03 AM
>>>>>>>> To: ceph-devel
>>>>>>>> Subject: bluestore onode diet and encoding overhead
>>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> With Igor's patch last week I was able to get some bluestore
>>>>>>>> performance runs in without segfaulting and started looking int
>>>>>>>> the results.
>>>>>>>> Somewhere along the line we really screwed up read performance,
>>>>>>>> but that's another topic.  Right now I want to focus on random
>>>>>>>> writes.
>>>>>>>> Before we put the onode on a diet we were seeing massive amounts
>>>>>>>> of read traffic in RocksDB during compaction that caused write
>>>>>>>> stalls during 4K random writes.  Random write performance on fast
>>>>>>>> hardware like NVMe devices was often below filestore at anything
>>>>>>>> other than very large IO sizes.  This was largely due to the size
>>>>>>>> of the onode compounded with RocksDB's tendency toward read and
>>>>>>>> write amplification.
>>>>>>>>
>>>>>>>> The new test results look very promising.  We've dramatically
>>>>>>>> improved performance of random writes at most IO sizes, so that
>>>>>>>> they are now typically quite a bit higher than both filestore and
>>>>>>>> older bluestore code.  Unfortunately for very small IO sizes
>>>>>>>> performance hasn't improved much.  We are no longer seeing huge
>>>>>>>> amounts of RocksDB read traffic and fewer write stalls.  We are
>>>>>>>> however seeing huge memory usage (~9GB RSS per OSD) and very high
>>>>>>>> CPU usage.  I think this confirms some of the memory issues
>>>>>>>> somnath was continuing to see.  I don't think it's a leak exactly
>>>>>>>> based on how the OSDs were behaving, but we need to run through
>>>>>>>> massif still to be sure.
>>>>>>>>
>>>>>>>> I ended up spending some time tonight with perf and digging
>>>>>>>> through the encode code.  I wrote up some notes with graphs and
>>>>>>>> code snippets and decided to put them up on the web.  Basically
>>>>>>>> some of the encoding changes we implemented last month to reduce
>>>>>>>> the onode size also appear to result in more buffer::list appends
>>>>>>>> and the associated overhead.
>>>>>>>> I've been trying to think through ways to improve the situation
>>>>>>>> and thought other people might have some ideas too.  Here's a link
>>>>>>>> to the short writeup:
>>>>>>>>
>>>>>>>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?
>>>>>>>> usp=sharing
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Mark
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>> ceph-devel"
>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>>> message is intended only for the use of the designated
>>>>>>>> recipient(s) named above. If the reader of this message is not the
>>>>>>>> intended recipient, you are hereby notified that you have received
>>>>>>>> this message in error and that any review, dissemination,
>>>>>>>> distribution, or copying of this message is strictly prohibited.
>>>>>>>> If you have received this communication in error, please notify
>>>>>>>> the sender by telephone or e-mail (as shown above) immediately and
>>>>>>>> destroy any and all copies of this message in your possession
>>>>>>>> (whether hard copies or electronically stored copies).
>>>>>>>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j
>>>>>>>>    f   h   z \x1e w
>>>     j:+v   w j m         zZ+     ݢj"  !tml=
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> PLEASE NOTE: The information contained in this electronic mail
>>> message is intended only for the use of the designated recipient(s)
>>> named above. If the reader of this message is not the intended
>>> recipient, you are hereby notified that you have received this
>>> message in error and that any review, dissemination, distribution, or
>>> copying of this message is strictly prohibited. If you have received
>>> this communication in error, please notify the sender by telephone or
>>> e-mail (as shown above) immediately and destroy any and all copies of
>>> this message in your possession (whether hard copies or
>>> electronically stored copies).
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-12 15:37   ` Mark Nelson
@ 2016-07-12 21:15     ` Allen Samuels
  2016-07-12 22:04       ` Mark Nelson
  0 siblings, 1 reply; 39+ messages in thread
From: Allen Samuels @ 2016-07-12 21:15 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel

Great papers!

Your profiling pretty much shows that the problem is really the buffer::list stuff and not the encoding itself (at least not yet!)

Yes, it's relatively each to fix the buffer encoding. You just have to over-allocate (do a worst-case computation for the data), and then do the encoding into the over-allocated chunk and then free up the unused portion.


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, July 12, 2016 8:38 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: bluestore onode diet and encoding overhead
> 
> 
> 
> On 07/12/2016 10:20 AM, Allen Samuels wrote:
> > Good analysis.
> >
> > My original comments about putting the oNode on a diet included the idea
> of a "custom" encode/decode path for certain high-usage cases. At the time,
> Sage resisted going down that path hoping that a more optimized generic
> case would get the job done. Your analysis shows that while we've achieved
> significant space reduction this has come at the expense of CPU time -- which
> dominates small object performance (I suspect that eventually we'd discover
> that the variable length decode path would be responsible for a substantial
> read performance degradation also -- which may or may not be part of the
> read performance drop-off that you're seeing). This isn't a surprising result,
> though it is unfortunate.
> >
> > I believe we need to revisit the idea of custom encode/decode paths for
> high-usage cases, only now the gains need to be focused on CPU utilization
> as well as space efficiency.
> 
> I'm not against it, but it might be worth at least a quick attempt at
> preallocating the append_buffer and/or Piotr's idea to directly memcpy
> without doing the append at all.  It may be that helps quite a bit (though
> perhaps it's not enough in the long run).
> 
> A couple of other thoughts:
> 
> I still think SIMD encode approaches are interesting if we can lay data out in
> memory in a friendly way (This feels like it might be painful
> though):
> 
> http://arxiv.org/abs/1209.2137
> 
> But on the other hand, Kenton Varda who was previously a primary author
> on google's protocol buffers ended up doing something a little different than
> varint:
> 
> https://capnproto.org/encoding.html
> 
> Look specifically at the packing section.  It looks somewhat attractive to me.
> 
> Mark
> 
> >
> > I believe this activity can also address some of the memory consumption
> issues that we're seeing now. I believe that the current lextent/blob/pextent
> usage of standard STL maps is both space and time inefficient -- in a place
> where it matters a lot. Sage has already discussed usage of something like
> flat_map from the boost library as a way to reduce the memory overhead,
> etc. I believe this is the right direction.
> >
> > Where are we on getting boost into our build?
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, Milpitas, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >> owner@vger.kernel.org] On Behalf Of Mark Nelson
> >> Sent: Tuesday, July 12, 2016 12:03 AM
> >> To: ceph-devel <ceph-devel@vger.kernel.org>
> >> Subject: bluestore onode diet and encoding overhead
> >>
> >> Hi All,
> >>
> >> With Igor's patch last week I was able to get some bluestore performance
> >> runs in without segfaulting and started looking int the results.
> >> Somewhere along the line we really screwed up read performance, but
> >> that's another topic.  Right now I want to focus on random writes.
> >> Before we put the onode on a diet we were seeing massive amounts of
> read
> >> traffic in RocksDB during compaction that caused write stalls during 4K
> >> random writes.  Random write performance on fast hardware like NVMe
> >> devices was often below filestore at anything other than very large IO
> sizes.
> >> This was largely due to the size of the onode compounded with RocksDB's
> >> tendency toward read and write amplification.
> >>
> >> The new test results look very promising.  We've dramatically improved
> >> performance of random writes at most IO sizes, so that they are now
> >> typically quite a bit higher than both filestore and older bluestore code.
> >> Unfortunately for very small IO sizes performance hasn't improved much.
> >> We are no longer seeing huge amounts of RocksDB read traffic and fewer
> >> write stalls.  We are however seeing huge memory usage (~9GB RSS per
> >> OSD) and very high CPU usage.  I think this confirms some of the memory
> >> issues somnath was continuing to see.  I don't think it's a leak exactly
> based
> >> on how the OSDs were behaving, but we need to run through massif still
> to
> >> be sure.
> >>
> >> I ended up spending some time tonight with perf and digging through the
> >> encode code.  I wrote up some notes with graphs and code snippets and
> >> decided to put them up on the web.  Basically some of the encoding
> changes
> >> we implemented last month to reduce the onode size also appear to
> result in
> >> more buffer::list appends and the associated overhead.
> >> I've been trying to think through ways to improve the situation and
> thought
> >> other people might have some ideas too.  Here's a link to the short
> writeup:
> >>
> >>
> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?us
> >> p=sharing
> >>
> >> Thanks,
> >> Mark
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the
> >> body of a message to majordomo@vger.kernel.org More majordomo info
> at
> >> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-12 21:15     ` Allen Samuels
@ 2016-07-12 22:04       ` Mark Nelson
  0 siblings, 0 replies; 39+ messages in thread
From: Mark Nelson @ 2016-07-12 22:04 UTC (permalink / raw)
  To: Allen Samuels, ceph-devel

On 07/12/2016 04:15 PM, Allen Samuels wrote:
> Great papers!

Both are backed by open source code on github, which was some of my 
motivation for looking at them.  The SIMD encoding paper only deals with 
32bit ints afaik, but Cap'n Protocol looks pretty robust/convenient out 
of the box.

>
> Your profiling pretty much shows that the problem is really the buffer::list stuff and not the encoding itself (at least not yet!)
>
> Yes, it's relatively each to fix the buffer encoding. You just have to over-allocate (do a worst-case computation for the data), and then do the encoding into the over-allocated chunk and then free up the unused portion.
>
>
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mnelson@redhat.com]
>> Sent: Tuesday, July 12, 2016 8:38 AM
>> To: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
>> devel@vger.kernel.org>
>> Subject: Re: bluestore onode diet and encoding overhead
>>
>>
>>
>> On 07/12/2016 10:20 AM, Allen Samuels wrote:
>>> Good analysis.
>>>
>>> My original comments about putting the oNode on a diet included the idea
>> of a "custom" encode/decode path for certain high-usage cases. At the time,
>> Sage resisted going down that path hoping that a more optimized generic
>> case would get the job done. Your analysis shows that while we've achieved
>> significant space reduction this has come at the expense of CPU time -- which
>> dominates small object performance (I suspect that eventually we'd discover
>> that the variable length decode path would be responsible for a substantial
>> read performance degradation also -- which may or may not be part of the
>> read performance drop-off that you're seeing). This isn't a surprising result,
>> though it is unfortunate.
>>>
>>> I believe we need to revisit the idea of custom encode/decode paths for
>> high-usage cases, only now the gains need to be focused on CPU utilization
>> as well as space efficiency.
>>
>> I'm not against it, but it might be worth at least a quick attempt at
>> preallocating the append_buffer and/or Piotr's idea to directly memcpy
>> without doing the append at all.  It may be that helps quite a bit (though
>> perhaps it's not enough in the long run).
>>
>> A couple of other thoughts:
>>
>> I still think SIMD encode approaches are interesting if we can lay data out in
>> memory in a friendly way (This feels like it might be painful
>> though):
>>
>> http://arxiv.org/abs/1209.2137
>>
>> But on the other hand, Kenton Varda who was previously a primary author
>> on google's protocol buffers ended up doing something a little different than
>> varint:
>>
>> https://capnproto.org/encoding.html
>>
>> Look specifically at the packing section.  It looks somewhat attractive to me.
>>
>> Mark
>>
>>>
>>> I believe this activity can also address some of the memory consumption
>> issues that we're seeing now. I believe that the current lextent/blob/pextent
>> usage of standard STL maps is both space and time inefficient -- in a place
>> where it matters a lot. Sage has already discussed usage of something like
>> flat_map from the boost library as a way to reduce the memory overhead,
>> etc. I believe this is the right direction.
>>>
>>> Where are we on getting boost into our build?
>>>
>>> Allen Samuels
>>> SanDisk |a Western Digital brand
>>> 2880 Junction Avenue, Milpitas, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>
>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>> owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>> Sent: Tuesday, July 12, 2016 12:03 AM
>>>> To: ceph-devel <ceph-devel@vger.kernel.org>
>>>> Subject: bluestore onode diet and encoding overhead
>>>>
>>>> Hi All,
>>>>
>>>> With Igor's patch last week I was able to get some bluestore performance
>>>> runs in without segfaulting and started looking int the results.
>>>> Somewhere along the line we really screwed up read performance, but
>>>> that's another topic.  Right now I want to focus on random writes.
>>>> Before we put the onode on a diet we were seeing massive amounts of
>> read
>>>> traffic in RocksDB during compaction that caused write stalls during 4K
>>>> random writes.  Random write performance on fast hardware like NVMe
>>>> devices was often below filestore at anything other than very large IO
>> sizes.
>>>> This was largely due to the size of the onode compounded with RocksDB's
>>>> tendency toward read and write amplification.
>>>>
>>>> The new test results look very promising.  We've dramatically improved
>>>> performance of random writes at most IO sizes, so that they are now
>>>> typically quite a bit higher than both filestore and older bluestore code.
>>>> Unfortunately for very small IO sizes performance hasn't improved much.
>>>> We are no longer seeing huge amounts of RocksDB read traffic and fewer
>>>> write stalls.  We are however seeing huge memory usage (~9GB RSS per
>>>> OSD) and very high CPU usage.  I think this confirms some of the memory
>>>> issues somnath was continuing to see.  I don't think it's a leak exactly
>> based
>>>> on how the OSDs were behaving, but we need to run through massif still
>> to
>>>> be sure.
>>>>
>>>> I ended up spending some time tonight with perf and digging through the
>>>> encode code.  I wrote up some notes with graphs and code snippets and
>>>> decided to put them up on the web.  Basically some of the encoding
>> changes
>>>> we implemented last month to reduce the onode size also appear to
>> result in
>>>> more buffer::list appends and the associated overhead.
>>>> I've been trying to think through ways to improve the situation and
>> thought
>>>> other people might have some ideas too.  Here's a link to the short
>> writeup:
>>>>
>>>>
>> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?us
>>>> p=sharing
>>>>
>>>> Thanks,
>>>> Mark
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the
>>>> body of a message to majordomo@vger.kernel.org More majordomo info
>> at
>>>> http://vger.kernel.org/majordomo-info.html
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-12 15:20 ` Allen Samuels
  2016-07-12 15:37   ` Mark Nelson
@ 2016-07-13  1:50   ` Sage Weil
  2016-07-13  3:13     ` Mark Nelson
  2016-07-13 14:47     ` Samuel Just
  1 sibling, 2 replies; 39+ messages in thread
From: Sage Weil @ 2016-07-13  1:50 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Mark Nelson, ceph-devel

On Tue, 12 Jul 2016, Allen Samuels wrote:
> Good analysis.
> 
> My original comments about putting the oNode on a diet included the idea 
> of a "custom" encode/decode path for certain high-usage cases. At the 
> time, Sage resisted going down that path hoping that a more optimized 
> generic case would get the job done. Your analysis shows that while 
> we've achieved significant space reduction this has come at the expense 
> of CPU time -- which dominates small object performance (I suspect that 
> eventually we'd discover that the variable length decode path would be 
> responsible for a substantial read performance degradation also -- which 
> may or may not be part of the read performance drop-off that you're 
> seeing). This isn't a surprising result, though it is unfortunate.
> 
> I believe we need to revisit the idea of custom encode/decode paths for 
> high-usage cases, only now the gains need to be focused on CPU 
> utilization as well as space efficiency.

I still think we can get most or all of the way there in a generic way by 
revising the way that we interact with bufferlist for encode and decode.  
We haven't actually tried to optimize this yet, and the current code is 
pretty horribly inefficient (asserts all over the place, and many layers 
of pointer indirection to do a simple append).  I think we need to do two 
things:

1) decode path: optimize the iterator class so that it has a const char 
*current and const char *current_end that point into the current 
buffer::ptr.  This way any decode will have a single pointer 
add+comparison to ensure there is enough data to copy before falling into 
the slow path (partial buffer, move to next buffer, etc.).

2) Having that comparison is still not ideal, but we shoudl consider ways 
to get around that too.  For example, if we know that we are going to 
decode N M-byte things, we could do an iterator 'reserve' or 'check' that 
ensures we have a valid pointer for that much and then proceed without 
checks.  The interface here would be tricky, though, since in the slow 
case we'll span buffers and need to magically fall back to a different 
decode path (hard to maintain) or do a temporary copy (probably faster but 
we need to ensure the iterator owns it and frees is later).  I'd say this 
is step 2 and optional; step 1 will have the most benefit.

3) encode path: currently all encode methods take a bufferlist& and the 
bufferlist itself as an append buffer.  I think this is flawed and 
limiting.  Instead, we should make a new class called 
buffer::list::appender (or similar) and templatize the encode methods so 
they can take a safe_appender (which does bounds checking) or an 
unsafe_appender (which does not).  For the latter, the user takes 
responsibility for making sure there is enough space by doing a reserve() 
type call which returns an unsafe_appender, and it's their job to make 
sure they don't shove too much data into it.  That should make the encode 
path a memcpy + ptr increment (for savvy/optimized callers).

I suggest we use bluestore as a test case to make the interfaces work and 
be fast.  If we succeed we can take advantage of it across the reset of 
the code base as well.

That's my thinking, at least.  I haven't had time to prototype it out yet, 
but I think our goal should be to make the encode/decode paths capable of 
being a memcpy + ptr addition in the fast path, and let that guide the 
interface...

sage

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-13  1:50   ` Sage Weil
@ 2016-07-13  3:13     ` Mark Nelson
  2016-07-13  6:33       ` Piotr Dałek
  2016-07-14  5:52       ` Allen Samuels
  2016-07-13 14:47     ` Samuel Just
  1 sibling, 2 replies; 39+ messages in thread
From: Mark Nelson @ 2016-07-13  3:13 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: ceph-devel



On 07/12/2016 08:50 PM, Sage Weil wrote:
> On Tue, 12 Jul 2016, Allen Samuels wrote:
>> Good analysis.
>>
>> My original comments about putting the oNode on a diet included the idea
>> of a "custom" encode/decode path for certain high-usage cases. At the
>> time, Sage resisted going down that path hoping that a more optimized
>> generic case would get the job done. Your analysis shows that while
>> we've achieved significant space reduction this has come at the expense
>> of CPU time -- which dominates small object performance (I suspect that
>> eventually we'd discover that the variable length decode path would be
>> responsible for a substantial read performance degradation also -- which
>> may or may not be part of the read performance drop-off that you're
>> seeing). This isn't a surprising result, though it is unfortunate.
>>
>> I believe we need to revisit the idea of custom encode/decode paths for
>> high-usage cases, only now the gains need to be focused on CPU
>> utilization as well as space efficiency.
>
> I still think we can get most or all of the way there in a generic way by
> revising the way that we interact with bufferlist for encode and decode.
> We haven't actually tried to optimize this yet, and the current code is
> pretty horribly inefficient (asserts all over the place, and many layers
> of pointer indirection to do a simple append).  I think we need to do two
> things:
>
> 1) decode path: optimize the iterator class so that it has a const char
> *current and const char *current_end that point into the current
> buffer::ptr.  This way any decode will have a single pointer
> add+comparison to ensure there is enough data to copy before falling into
> the slow path (partial buffer, move to next buffer, etc.).
>

I don't have a good sense yet for how much this is hurting us in the 
read path.  We screwed something up in the last couple of weeks and 
small reads are quite slow.

> 2) Having that comparison is still not ideal, but we shoudl consider ways
> to get around that too.  For example, if we know that we are going to
> decode N M-byte things, we could do an iterator 'reserve' or 'check' that
> ensures we have a valid pointer for that much and then proceed without
> checks.  The interface here would be tricky, though, since in the slow
> case we'll span buffers and need to magically fall back to a different
> decode path (hard to maintain) or do a temporary copy (probably faster but
> we need to ensure the iterator owns it and frees is later).  I'd say this
> is step 2 and optional; step 1 will have the most benefit.
>
> 3) encode path: currently all encode methods take a bufferlist& and the
> bufferlist itself as an append buffer.  I think this is flawed and
> limiting.  Instead, we should make a new class called
> buffer::list::appender (or similar) and templatize the encode methods so
> they can take a safe_appender (which does bounds checking) or an
> unsafe_appender (which does not).  For the latter, the user takes
> responsibility for making sure there is enough space by doing a reserve()
> type call which returns an unsafe_appender, and it's their job to make
> sure they don't shove too much data into it.  That should make the encode
> path a memcpy + ptr increment (for savvy/optimized callers).

Seems reasonable and similar in performance to what Piotr and I were 
discussing this morning.  As a very simple test I was thinking of doing 
a quick size computation and then passing that in to increase the 
append_buffer size when the bufferlist is created in 
Bluestore::_txc_write_nodes.  His idea went a bit farther to break the 
encapsulation, compute the fully encoded message, and dump it directly 
into a buffer of a computed size without the extra assert checks or 
bounds checking.  Obviously his idea would be faster but more work.

It sounds like your solution would be similar but a bit more formalized.

>
> I suggest we use bluestore as a test case to make the interfaces work and
> be fast.  If we succeed we can take advantage of it across the reset of
> the code base as well.

Do we have other places in the code with similar byte append behavior? 
That's what's really killing us I think, especially with how small the 
new append_buffer is when you run out of space when appending bytes.

>
> That's my thinking, at least.  I haven't had time to prototype it out yet,
> but I think our goal should be to make the encode/decode paths capable of
> being a memcpy + ptr addition in the fast path, and let that guide the
> interface...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-13  3:13     ` Mark Nelson
@ 2016-07-13  6:33       ` Piotr Dałek
  2016-07-13 16:05         ` Sage Weil
  2016-07-14  5:52       ` Allen Samuels
  1 sibling, 1 reply; 39+ messages in thread
From: Piotr Dałek @ 2016-07-13  6:33 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sage Weil, Allen Samuels, ceph-devel

On Tue, Jul 12, 2016 at 10:13:14PM -0500, Mark Nelson wrote:
> 
> 
> On 07/12/2016 08:50 PM, Sage Weil wrote:
> >On Tue, 12 Jul 2016, Allen Samuels wrote:
> >>[..]
> >>I believe we need to revisit the idea of custom encode/decode paths for
> >>high-usage cases, only now the gains need to be focused on CPU
> >>utilization as well as space efficiency.
> >
> >I still think we can get most or all of the way there in a generic way by
> >revising the way that we interact with bufferlist for encode and decode.
> >We haven't actually tried to optimize this yet, and the current code is
> >pretty horribly inefficient (asserts all over the place, and many layers
> >of pointer indirection to do a simple append).  I think we need to do two
> >things:
> >
> >1) decode path: optimize the iterator class so that it has a const char
> >*current and const char *current_end that point into the current
> >buffer::ptr.  This way any decode will have a single pointer
> >add+comparison to ensure there is enough data to copy before falling into
> >the slow path (partial buffer, move to next buffer, etc.).
> 
> I don't have a good sense yet for how much this is hurting us in the
> read path.  We screwed something up in the last couple of weeks and
> small reads are quite slow.

The main issue with decode using bufferlist is that we cannot assume
anything regarding internal data memory layout. We can't do anything like

 int j = *((int*) bufferptr.c_str());

because bufferptr may be too short for "j" to be read in one go.
Possible solution would be to ensure that bufferptrs contain contiguous
blocks large enough to store one unit of data (be it, for example, onode
info).

> >2) Having that comparison is still not ideal, but we shoudl consider ways
> >to get around that too.  For example, if we know that we are going to
> >decode N M-byte things, we could do an iterator 'reserve' or 'check' that
> >ensures we have a valid pointer for that much and then proceed without
> >checks.  The interface here would be tricky, though, since in the slow
> >case we'll span buffers and need to magically fall back to a different
> >decode path (hard to maintain) or do a temporary copy (probably faster but
> >we need to ensure the iterator owns it and frees is later).  I'd say this
> >is step 2 and optional; step 1 will have the most benefit.

Exactly my point.
Regarding a copy, we could just do something like rebuild_contiguous() and
make sure bufferlist is a one, large bufferptr or it is split on logical
data unit boundary instead of random places that are messenger/underyling
I/O store dependent. That will take care of both memory ownership and
bufferlist continuity.

> >3) encode path: currently all encode methods take a bufferlist& and the
> >bufferlist itself as an append buffer.  I think this is flawed and
> >limiting.  Instead, we should make a new class called
> >buffer::list::appender (or similar) and templatize the encode methods so
> >they can take a safe_appender (which does bounds checking) or an
> >unsafe_appender (which does not).  For the latter, the user takes
> >responsibility for making sure there is enough space by doing a reserve()
> >type call which returns an unsafe_appender, and it's their job to make
> >sure they don't shove too much data into it.  That should make the encode
> >path a memcpy + ptr increment (for savvy/optimized callers).
> 
> Seems reasonable and similar in performance to what Piotr and I were
> discussing this morning.  As a very simple test I was thinking of
> doing a quick size computation and then passing that in to increase
> the append_buffer size when the bufferlist is created in
> Bluestore::_txc_write_nodes.  His idea went a bit farther to break
> the encapsulation, compute the fully encoded message, and dump it
> directly into a buffer of a computed size without the extra assert
> checks or bounds checking.  Obviously his idea would be faster but
> more work.
> 
> It sounds like your solution would be similar but a bit more formalized.

I like the idea, because that way we could add extra checks to debug builds
(added via preprocessor define) and have the ability to find bugs easier,
retaining performance on release/optimized builds.
Or we could go that way all-in, have single kind of appender and do bounds
check only on debug builds.

> >I suggest we use bluestore as a test case to make the interfaces work and
> >be fast.  If we succeed we can take advantage of it across the reset of
> >the code base as well.
> 
> Do we have other places in the code with similar byte append
> behavior? That's what's really killing us I think, especially with
> how small the new append_buffer is when you run out of space when
> appending bytes.

I still recommend doing a dry bench and measure how fast _txc_write_nodes
is. We may be spending a lot of time optimizing bufferlist API, when the
real issue lies in another place. Creation and destruction of bufferlist is
one of my point of concerns in _txc_write_nodes, but then, I have no clue on
how many times that happen per entire call.
If the following

   // (KeyValueDB::Transaction t, set<OnodeRef>::iterator p)
   t->set(PREFIX_OBJ, (*p)->key, bl);

performs synchronously (i.e. doesn't do anything else with bl contents after
leaving t->set call), we could just rewind bl and start overwriting it. In
worst case, we would alloc extra memory, which is still one case out of 3
possible (remaining being new onode/bnode of the same or smaller sizes
than previous one).

-- 
Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-13  1:50   ` Sage Weil
  2016-07-13  3:13     ` Mark Nelson
@ 2016-07-13 14:47     ` Samuel Just
  1 sibling, 0 replies; 39+ messages in thread
From: Samuel Just @ 2016-07-13 14:47 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, Mark Nelson, ceph-devel

I think that seems like a good way to go!
-Sam

On Tue, Jul 12, 2016 at 6:50 PM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 12 Jul 2016, Allen Samuels wrote:
>> Good analysis.
>>
>> My original comments about putting the oNode on a diet included the idea
>> of a "custom" encode/decode path for certain high-usage cases. At the
>> time, Sage resisted going down that path hoping that a more optimized
>> generic case would get the job done. Your analysis shows that while
>> we've achieved significant space reduction this has come at the expense
>> of CPU time -- which dominates small object performance (I suspect that
>> eventually we'd discover that the variable length decode path would be
>> responsible for a substantial read performance degradation also -- which
>> may or may not be part of the read performance drop-off that you're
>> seeing). This isn't a surprising result, though it is unfortunate.
>>
>> I believe we need to revisit the idea of custom encode/decode paths for
>> high-usage cases, only now the gains need to be focused on CPU
>> utilization as well as space efficiency.
>
> I still think we can get most or all of the way there in a generic way by
> revising the way that we interact with bufferlist for encode and decode.
> We haven't actually tried to optimize this yet, and the current code is
> pretty horribly inefficient (asserts all over the place, and many layers
> of pointer indirection to do a simple append).  I think we need to do two
> things:
>
> 1) decode path: optimize the iterator class so that it has a const char
> *current and const char *current_end that point into the current
> buffer::ptr.  This way any decode will have a single pointer
> add+comparison to ensure there is enough data to copy before falling into
> the slow path (partial buffer, move to next buffer, etc.).
>
> 2) Having that comparison is still not ideal, but we shoudl consider ways
> to get around that too.  For example, if we know that we are going to
> decode N M-byte things, we could do an iterator 'reserve' or 'check' that
> ensures we have a valid pointer for that much and then proceed without
> checks.  The interface here would be tricky, though, since in the slow
> case we'll span buffers and need to magically fall back to a different
> decode path (hard to maintain) or do a temporary copy (probably faster but
> we need to ensure the iterator owns it and frees is later).  I'd say this
> is step 2 and optional; step 1 will have the most benefit.
>
> 3) encode path: currently all encode methods take a bufferlist& and the
> bufferlist itself as an append buffer.  I think this is flawed and
> limiting.  Instead, we should make a new class called
> buffer::list::appender (or similar) and templatize the encode methods so
> they can take a safe_appender (which does bounds checking) or an
> unsafe_appender (which does not).  For the latter, the user takes
> responsibility for making sure there is enough space by doing a reserve()
> type call which returns an unsafe_appender, and it's their job to make
> sure they don't shove too much data into it.  That should make the encode
> path a memcpy + ptr increment (for savvy/optimized callers).
>
> I suggest we use bluestore as a test case to make the interfaces work and
> be fast.  If we succeed we can take advantage of it across the reset of
> the code base as well.
>
> That's my thinking, at least.  I haven't had time to prototype it out yet,
> but I think our goal should be to make the encode/decode paths capable of
> being a memcpy + ptr addition in the fast path, and let that guide the
> interface...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-13  6:33       ` Piotr Dałek
@ 2016-07-13 16:05         ` Sage Weil
  2016-07-13 21:29           ` Allen Samuels
  0 siblings, 1 reply; 39+ messages in thread
From: Sage Weil @ 2016-07-13 16:05 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: Mark Nelson, Allen Samuels, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6588 bytes --]

On Wed, 13 Jul 2016, Piotr Dałek wrote:
> On Tue, Jul 12, 2016 at 10:13:14PM -0500, Mark Nelson wrote:
> > 
> > 
> > On 07/12/2016 08:50 PM, Sage Weil wrote:
> > >On Tue, 12 Jul 2016, Allen Samuels wrote:
> > >>[..]
> > >>I believe we need to revisit the idea of custom encode/decode paths for
> > >>high-usage cases, only now the gains need to be focused on CPU
> > >>utilization as well as space efficiency.
> > >
> > >I still think we can get most or all of the way there in a generic way by
> > >revising the way that we interact with bufferlist for encode and decode.
> > >We haven't actually tried to optimize this yet, and the current code is
> > >pretty horribly inefficient (asserts all over the place, and many layers
> > >of pointer indirection to do a simple append).  I think we need to do two
> > >things:
> > >
> > >1) decode path: optimize the iterator class so that it has a const char
> > >*current and const char *current_end that point into the current
> > >buffer::ptr.  This way any decode will have a single pointer
> > >add+comparison to ensure there is enough data to copy before falling into
> > >the slow path (partial buffer, move to next buffer, etc.).
> > 
> > I don't have a good sense yet for how much this is hurting us in the
> > read path.  We screwed something up in the last couple of weeks and
> > small reads are quite slow.
> 
> The main issue with decode using bufferlist is that we cannot assume
> anything regarding internal data memory layout. We can't do anything like
> 
>  int j = *((int*) bufferptr.c_str());
> 
> because bufferptr may be too short for "j" to be read in one go.
> Possible solution would be to ensure that bufferptrs contain contiguous
> blocks large enough to store one unit of data (be it, for example, onode
> info).
> 
> > >2) Having that comparison is still not ideal, but we shoudl consider ways
> > >to get around that too.  For example, if we know that we are going to
> > >decode N M-byte things, we could do an iterator 'reserve' or 'check' that
> > >ensures we have a valid pointer for that much and then proceed without
> > >checks.  The interface here would be tricky, though, since in the slow
> > >case we'll span buffers and need to magically fall back to a different
> > >decode path (hard to maintain) or do a temporary copy (probably faster but
> > >we need to ensure the iterator owns it and frees is later).  I'd say this
> > >is step 2 and optional; step 1 will have the most benefit.
> 
> Exactly my point.
> Regarding a copy, we could just do something like rebuild_contiguous() and
> make sure bufferlist is a one, large bufferptr or it is split on logical
> data unit boundary instead of random places that are messenger/underyling
> I/O store dependent. That will take care of both memory ownership and
> bufferlist continuity.

In practice, it is pretty much always contiguous: msgr allocates a whole 
chunk, and when we read stuff off disk or out of kv store it is one chunk.  
Pretty much the only time it isn't is when you just encoded it... and we 
generaly don't decode in that case.

Anyway, the point is that we can do something pretty simple and 
non-optimal in the non-contiguous case (like rebuild()) and it 
shouldn't really matter.

> > >3) encode path: currently all encode methods take a bufferlist& and the
> > >bufferlist itself as an append buffer.  I think this is flawed and
> > >limiting.  Instead, we should make a new class called
> > >buffer::list::appender (or similar) and templatize the encode methods so
> > >they can take a safe_appender (which does bounds checking) or an
> > >unsafe_appender (which does not).  For the latter, the user takes
> > >responsibility for making sure there is enough space by doing a reserve()
> > >type call which returns an unsafe_appender, and it's their job to make
> > >sure they don't shove too much data into it.  That should make the encode
> > >path a memcpy + ptr increment (for savvy/optimized callers).
> > 
> > Seems reasonable and similar in performance to what Piotr and I were
> > discussing this morning.  As a very simple test I was thinking of
> > doing a quick size computation and then passing that in to increase
> > the append_buffer size when the bufferlist is created in
> > Bluestore::_txc_write_nodes.  His idea went a bit farther to break
> > the encapsulation, compute the fully encoded message, and dump it
> > directly into a buffer of a computed size without the extra assert
> > checks or bounds checking.  Obviously his idea would be faster but
> > more work.
> > 
> > It sounds like your solution would be similar but a bit more formalized.
> 
> I like the idea, because that way we could add extra checks to debug builds
> (added via preprocessor define) and have the ability to find bugs easier,
> retaining performance on release/optimized builds.
> Or we could go that way all-in, have single kind of appender and do bounds
> check only on debug builds.

Yeah--I'd go for debug asserts that compile out of release builds.  (We 
should either create a new assert macro, or go do teh work to change 
current assert()'s to ceph_assert_always() or whatever.)

> > >I suggest we use bluestore as a test case to make the interfaces work and
> > >be fast.  If we succeed we can take advantage of it across the reset of
> > >the code base as well.
> > 
> > Do we have other places in the code with similar byte append
> > behavior? That's what's really killing us I think, especially with
> > how small the new append_buffer is when you run out of space when
> > appending bytes.
> 
> I still recommend doing a dry bench and measure how fast _txc_write_nodes
> is. We may be spending a lot of time optimizing bufferlist API, when the
> real issue lies in another place. Creation and destruction of bufferlist is
> one of my point of concerns in _txc_write_nodes, but then, I have no clue on
> how many times that happen per entire call.
> If the following
> 
>    // (KeyValueDB::Transaction t, set<OnodeRef>::iterator p)
>    t->set(PREFIX_OBJ, (*p)->key, bl);
> 
> performs synchronously (i.e. doesn't do anything else with bl contents after
> leaving t->set call), we could just rewind bl and start overwriting it. In
> worst case, we would alloc extra memory, which is still one case out of 3
> possible (remaining being new onode/bnode of the same or smaller sizes
> than previous one).

Almost certain it does make a copy of the buffer.  Maybe a bufferlist 
method that clears ptr list but keeps one of them as the append buffer if 
it has no more references.  clear_keep_append() or something.

sage

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-13 16:05         ` Sage Weil
@ 2016-07-13 21:29           ` Allen Samuels
  0 siblings, 0 replies; 39+ messages in thread
From: Allen Samuels @ 2016-07-13 21:29 UTC (permalink / raw)
  To: Sage Weil, Piotr Dałek; +Cc: Mark Nelson, ceph-devel

Just for labelling, I'll invent the names fast_encode and fast_decode as labels for the new scheme.

Fundamentally, the problem that we have is lack of knowledge about how much data is going to be encoded (and to a lesser extent, decoded). Currently, we basically check for space on each micro-encode operation. Clearly those checks have to be eliminated. That leaves us with basically three choices:

(1) Pre-allocate a buffer sufficiently large.
(2) Check for sufficient space at "strategic" places in the code.
(3) Fiddle with the CPU memory mapping tables to generate a page fault when we run off the end of our buffer (which clearly must be page aligned, blah blah blah) and then contrive to auto-allocate and restart.

I reject (3) as too complex and not worth the effort.

I reject (2) as difficult to maintain and difficult to determine where the "strategic" points in the code are that retain correctness, but happen infrequently enough to minimize CPU utilization. 

That leaves us with (1).

Method (1) must consist of making a worst-case prediction of the size of the encoded data. Encoded the data into the allocated buffer and then "freeing" the unused portion (since tight encoding of the data is content-dependent in length). The encode and free processes are relatively straightforward, but the prediction process has some options to explore.

(1a) Constant "maximum". We could easily establish something like a 128K or 256K constant upper limit and just use a buffer of that size.
(1b) computed "maximum", For simple objects (a few fields/containers) it's relatively easy to add up the necessary sizes to generate an estimate. For complex objects, this is error prone because you're duplicating the encode with the estimate-size logic.

Right now, I'm trying to concoct some combinations of C++ templates that's lets me merge the estimate-size, encode and decode functions into a common routine so that we can avoid this system-matic error. Stay tuned.

One of the dangers of the prediction scheme is what if the prediction is incorrect -- too small. Then you'll get buffer overrun. It's been suggested that we insert some special debug-assert code to detect that situation which is only enabled at compile-time. I believe this is NOT the right solution. The buffer overrun problem is data dependent. That means it will be especially hard to debug in the field as what you're looking for is essentially silent data corruption.

I believe that the best solution is to simply check the fully-encoded buffer against the estimate that was made. If the estimate is too small, then assert-out. Leave this in production code. If we encounter an encode-buffer overrun at least we'll now that was the source of the problem and we can fix it. (I assert that if you see this, it'll be pretty obvious where it went wrong -- especially if I'm able to create a unified estimate-size, encode and decode function.

One last problem to solve is the decode problem. We need to know how much data is in a fast-encode buffer in order to ensure that it's not fragmented in the buffer::list. This is relatively easy if the encode leaves a "total bytes" field at the start of the operation.

Now we can see the pseudo-code for fast-encode.

Void fast_encode(object& o, bufferptr& b) {
   size_t estimate = o.estimate_sizeof_fast_encode() + sizeof(int);  // Extra int is for explicit size of the overall buffer.
   char * buf_start = b.push_back (estimate);	// returns pointer to a block of memory appended to the end that's of the specified size;
   char * buf_end  = o.do_fast_encode(buf_start + sizeof(int)); // starts serialization into the address pointed at by the input parameter, returns "next" pointer, i.e., pointer to next unused byte 
   size_t consumed = buf_end - buf_start; // Compute consumed bytes

   assert(consumed <= estimate);  // Here's where we catch a silent data corruption due to an overflow

   *(int *)buf_start = consumed;  // Total size of consumed buffer, including the starting int stored back at the start...

   size_t unused = estimate - consumed; // Amount of unused space at the end.

   b.pop_back(unused); // "free" the unused bytes at the end of the buffer.

}

Now we can see the shape of the fast_decode function.

Void fast_decode(object& o, bufferptr& b) {

   Int encode_size = b.pop_front_int();   // Remove first bytes where the encode left the size of the full buffer

   const char *buf_start = b.contiguous_ptr(encode_size - sizeof(int));   // return pointer to sequential buffer of specified number of bytes <<<<- here's where we might have to copy discontiguous buffers into a single buffer.

   const char *buf_end = o.fast_decode(buf_start);		              // Decode data from the buffer into the object, return the pointer to the last consumed byte.

   Size_t consumed_bytes = buf_end - buf_start;

  Assert(consumed_bytes == encode_size);			// Consistency check, we consumed exactly as much as we encoded (NB, I may have left off a " + sizeof(int)" for the initial buffer size.)

   b.pop_front(consumed_bytes);				// logically remove the bytes that we've consumed

}

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, July 13, 2016 9:06 AM
> To: Piotr Dałek <branch@predictor.org.pl>
> Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> <Allen.Samuels@sandisk.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: bluestore onode diet and encoding overhead
> 
> On Wed, 13 Jul 2016, Piotr Dałek wrote:
> > On Tue, Jul 12, 2016 at 10:13:14PM -0500, Mark Nelson wrote:
> > >
> > >
> > > On 07/12/2016 08:50 PM, Sage Weil wrote:
> > > >On Tue, 12 Jul 2016, Allen Samuels wrote:
> > > >>[..]
> > > >>I believe we need to revisit the idea of custom encode/decode
> > > >>paths for high-usage cases, only now the gains need to be focused
> > > >>on CPU utilization as well as space efficiency.
> > > >
> > > >I still think we can get most or all of the way there in a generic
> > > >way by revising the way that we interact with bufferlist for encode and
> decode.
> > > >We haven't actually tried to optimize this yet, and the current
> > > >code is pretty horribly inefficient (asserts all over the place,
> > > >and many layers of pointer indirection to do a simple append).  I
> > > >think we need to do two
> > > >things:
> > > >
> > > >1) decode path: optimize the iterator class so that it has a const
> > > >char *current and const char *current_end that point into the
> > > >current buffer::ptr.  This way any decode will have a single
> > > >pointer
> > > >add+comparison to ensure there is enough data to copy before
> > > >add+falling into
> > > >the slow path (partial buffer, move to next buffer, etc.).
> > >
> > > I don't have a good sense yet for how much this is hurting us in the
> > > read path.  We screwed something up in the last couple of weeks and
> > > small reads are quite slow.
> >
> > The main issue with decode using bufferlist is that we cannot assume
> > anything regarding internal data memory layout. We can't do anything
> > like
> >
> >  int j = *((int*) bufferptr.c_str());
> >
> > because bufferptr may be too short for "j" to be read in one go.
> > Possible solution would be to ensure that bufferptrs contain
> > contiguous blocks large enough to store one unit of data (be it, for
> > example, onode info).
> >
> > > >2) Having that comparison is still not ideal, but we shoudl
> > > >consider ways to get around that too.  For example, if we know that
> > > >we are going to decode N M-byte things, we could do an iterator
> > > >'reserve' or 'check' that ensures we have a valid pointer for that
> > > >much and then proceed without checks.  The interface here would be
> > > >tricky, though, since in the slow case we'll span buffers and need
> > > >to magically fall back to a different decode path (hard to
> > > >maintain) or do a temporary copy (probably faster but we need to
> > > >ensure the iterator owns it and frees is later).  I'd say this is step 2 and
> optional; step 1 will have the most benefit.
> >
> > Exactly my point.
> > Regarding a copy, we could just do something like rebuild_contiguous()
> > and make sure bufferlist is a one, large bufferptr or it is split on
> > logical data unit boundary instead of random places that are
> > messenger/underyling I/O store dependent. That will take care of both
> > memory ownership and bufferlist continuity.
> 
> In practice, it is pretty much always contiguous: msgr allocates a whole chunk,
> and when we read stuff off disk or out of kv store it is one chunk.
> Pretty much the only time it isn't is when you just encoded it... and we
> generaly don't decode in that case.
> 
> Anyway, the point is that we can do something pretty simple and non-
> optimal in the non-contiguous case (like rebuild()) and it shouldn't really
> matter.
> 
> > > >3) encode path: currently all encode methods take a bufferlist& and
> > > >the bufferlist itself as an append buffer.  I think this is flawed
> > > >and limiting.  Instead, we should make a new class called
> > > >buffer::list::appender (or similar) and templatize the encode
> > > >methods so they can take a safe_appender (which does bounds
> > > >checking) or an unsafe_appender (which does not).  For the latter,
> > > >the user takes responsibility for making sure there is enough space
> > > >by doing a reserve() type call which returns an unsafe_appender,
> > > >and it's their job to make sure they don't shove too much data into
> > > >it.  That should make the encode path a memcpy + ptr increment (for
> savvy/optimized callers).
> > >
> > > Seems reasonable and similar in performance to what Piotr and I were
> > > discussing this morning.  As a very simple test I was thinking of
> > > doing a quick size computation and then passing that in to increase
> > > the append_buffer size when the bufferlist is created in
> > > Bluestore::_txc_write_nodes.  His idea went a bit farther to break
> > > the encapsulation, compute the fully encoded message, and dump it
> > > directly into a buffer of a computed size without the extra assert
> > > checks or bounds checking.  Obviously his idea would be faster but
> > > more work.
> > >
> > > It sounds like your solution would be similar but a bit more formalized.
> >
> > I like the idea, because that way we could add extra checks to debug
> > builds (added via preprocessor define) and have the ability to find
> > bugs easier, retaining performance on release/optimized builds.
> > Or we could go that way all-in, have single kind of appender and do
> > bounds check only on debug builds.
> 
> Yeah--I'd go for debug asserts that compile out of release builds.  (We should
> either create a new assert macro, or go do teh work to change current
> assert()'s to ceph_assert_always() or whatever.)
> 
> > > >I suggest we use bluestore as a test case to make the interfaces
> > > >work and be fast.  If we succeed we can take advantage of it across
> > > >the reset of the code base as well.
> > >
> > > Do we have other places in the code with similar byte append
> > > behavior? That's what's really killing us I think, especially with
> > > how small the new append_buffer is when you run out of space when
> > > appending bytes.
> >
> > I still recommend doing a dry bench and measure how fast
> > _txc_write_nodes is. We may be spending a lot of time optimizing
> > bufferlist API, when the real issue lies in another place. Creation
> > and destruction of bufferlist is one of my point of concerns in
> > _txc_write_nodes, but then, I have no clue on how many times that
> happen per entire call.
> > If the following
> >
> >    // (KeyValueDB::Transaction t, set<OnodeRef>::iterator p)
> >    t->set(PREFIX_OBJ, (*p)->key, bl);
> >
> > performs synchronously (i.e. doesn't do anything else with bl contents
> > after leaving t->set call), we could just rewind bl and start
> > overwriting it. In worst case, we would alloc extra memory, which is
> > still one case out of 3 possible (remaining being new onode/bnode of
> > the same or smaller sizes than previous one).
> 
> Almost certain it does make a copy of the buffer.  Maybe a bufferlist method
> that clears ptr list but keeps one of them as the append buffer if it has no
> more references.  clear_keep_append() or something.
> 
> sage

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-13  3:13     ` Mark Nelson
  2016-07-13  6:33       ` Piotr Dałek
@ 2016-07-14  5:52       ` Allen Samuels
  2016-07-14 11:15         ` Mark Nelson
  1 sibling, 1 reply; 39+ messages in thread
From: Allen Samuels @ 2016-07-14  5:52 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil; +Cc: ceph-devel

As promised, here's some code that hacks out a new encode/decode framework. That has the advantage of only having to list the fields of a struct once and is pretty much guaranteed to never overrun a buffer....

Comments are requested :)


#include <iostream>
#include <fstream>
#include <set>
#include <string>
#include <string.h>

/*******************************************************

  
   New fast encode/decode framework.
  
   The entire framework is built around the idea that each object has three operations:
  
     ESTIMATE  -- worst-case estimate of the amount of storage required for this object
     ENCODE    -- encode object into buffer of size ESTIMATE
     DECODE    -- encode object from buffer of size actual.
  
   Each object has a single templated function that actually provides all three operations in a single set of code.
   But doing this, it's pretty much guaranteed that the ESTIMATE and the ENCODE code are in harmony (i.e. that the estimate is correct)
   it also saves a lot of typing/reading...
  
   Generally, all three operations are provided on a single function name with the input and return parameters overloaded to distinguish them.
  
   It's observed that for each of the three operations there is a single value which needs to be transmitted between each of the micro-encode/decode calls
   Yes, this is confusing, but let's look at a simple example
  
    struct simple {
      int a;
      float b;
      string c;
      set<int> d;
    };
  
    To encode this struct we generate a function that does the micro-encoding of each of the fields of the struct
    Here's an example of a function that does the ESTIMATE operation.
  
    size_t simple::estimate() {
       return 
          sizeof(a) +
          sizeof(b) +
          c.size() +
          d.size() * sizeof(int);
    }

    We're going to re-write it as:

    size_t simple::estimate(size_t p) {
       p = estimate(p,a);
       p = estimate(p,b);
       p = estimate(p,c);
       p = estimate(p,d);
       return p;
    }

    assuming that the sorta function:

    template<typename t> size_t estimate(size_t p,t& o) { return p + sizeof(o); }
    template<typename t> size_t estimate(size_t p,set<t>& o) { return p + o.size() * sizeof(t); }
    

    similarly, the encode operation is represented as:

    char * simple::encode(char *p) {
       p = encode(p,a);
       p = encode(p,b);
       p = encode(p,c);
       p = encode(p,d);
       return p;
    }
       
    similarly, the decode operation is represented as:

    const char * simple::decode(const char *p) {
       p = decode(p,a);
       p = decode(p,b);
       p = decode(p,c);
       p = decode(p,d);
       return p;
    }
       

You can now see that it's possible to create a single function that does all three operations in a single block
of code, provided that you can fiddle the input/output parameter types appropriately.

In essence the pattern is

    p = enc_dec(p,struct_field_1);
    p = enc_dec(p,struct_field_2);
    p = enc_dec(p,struct_field_3);

With the type of p being set differently for each operation, i.e.,
    for ESTIMATE, p = size_t
    for ENCODE,   p = char *
    for DECODE,   p = const char *

This is the essence of how the encode/decode framework operates. Though there is some more sophistication...

----------------------

We also want to allow the encode/decode machinery to be per-type and to operate 

*****************************************************************************/

using namespace std;

//
// Just like the existing encode/decode machinery. The environment provides a rich set of 
// pre-defined encodes for primitive types and containers
//

#define DEFINE_ENC_DEC_RAW(type) \
inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p + sizeof(type); } \
inline const char *enc_dec(const char *p,type &o) { o = *(const type *)p; return p + sizeof(type); }

DEFINE_ENC_DEC_RAW(int);
DEFINE_ENC_DEC_RAW(size_t);

//
// String encode/decode (Yea, I know size_t isn't portable -- this is an EXAMPLE man...)
//
inline size_t enc_dec(size_t p,string& s) { return p + sizeof(size_t) + s.size(); }
inline char * enc_dec(char * p,string& s) { *(size_t *)p = s.size(); memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p + sizeof(size_t) + s.size(); }
inline const char *enc_dec(const char *p,string& s) { s = string(p + sizeof(size_t),*(size_t *)p); return p + sizeof(size_t) + s.size(); }

//
// Let's do a container.
//
// One of the problems with a container is that making an accurate estimate of the size
// would theoretically require that you walk the entire container and add up the sizes of each element.
// We probably don't want to do that. So here, I do a hack that just assumes that I can fake up a individual element
// and multiple that by the number of elements in a container. This hack works anytime that the estimate function
// for the contained type has a fixed maximum size. BTW, this is safe, if the contained type has a variable size
//  (like set<string>) then it will fault out the first time you run it.
//
// Naturally, something like set<string> or map<string,string> is a highly desirable thing to be able to encode/decode
// there's no reason that you can't create a enc_dec_slow function that properly computes the maximum size by walking the container.
//
template<typename t>
inline size_t enc_dec(size_t p,set<t>& s) { return p + sizeof(size_t) + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }

template<typename t>
inline char *enc_dec(char *p,set<t>& s) {
   size_t sz = s.size();
   p = enc_dec(p,sz);
   for (const t& e : s) {
      p = enc_dec(p,const_cast<t&>(e));
   }
   return p;
}

template<typename t>
inline const char *enc_dec(const char *p,set<t>&s) {
   size_t sz;
   p = enc_dec(p,sz);
   while (sz--) {
      t temp;
      p = enc_dec(p,temp);
      s.insert(temp);
   }
   return p;
}

//
// Specialized encode/decode for a single data type. These are invoked explicitly...
//
inline size_t enc_dec_lba(size_t p,int& lba) {
   return p + sizeof(lba); // Max....
}

inline char * enc_dec_lba(char *p,int& lba) {
   *p = 15;
   return p + 1; // blah blah
}

inline const char *enc_dec_lba(const char *p,int& lba) {
   lba = *p;
   return p+1;
}

//
// Specialized encode/decode for more sophisticated things primitives.
//
// Here's an example of a encode/decoder for a pair of fields
//
inline size_t enc_dec_range(size_t p,short& start,short& end) {
   return p + 2 * sizeof(short);
}

inline char *enc_dec_range(char *p, short& start, short& end) {
   short *s = (short *) p;
   s[0] = start;
   s[1] = end;
   return p + sizeof(short) * 2;
}

inline const char *enc_dec_range(const char *p,short& start, short& end) {
   start = *(short *)p;
   end   = *(short *)(p + sizeof(short));
   return p + 2*sizeof(short);
}


//
// Some C++ template wizardry to make the single encode/decode function possible.
//
enum SERIAL_TYPE {
   ESTIMATE,
   ENCODE,
   DECODE
};

template <enum SERIAL_TYPE s> struct serial_type;

template<> struct serial_type<ESTIMATE> { typedef size_t type; };
template<> struct serial_type<ENCODE>   { typedef char * type; };
template<> struct serial_type<DECODE>   { typedef const char *type; };

//
// This macro is the key, it connects the external non-member function to the correct member function.
//
#define DEFINE_STRUCT_ENC_DEC(s) \
inline size_t      enc_dec(size_t p, s &o) { return o.enc_dec<ESTIMATE>(p); } \
inline char *      enc_dec(char *p , s &o)  { return o.enc_dec<ENCODE>(p); } \
inline const char *enc_dec(const char *p,s &o)  { return o.enc_dec<DECODE>(p); }

//
// Our example structure
//
struct astruct {
   int a;
   set<int> b;
   int lba;
   short start,end;

   //
   // <<<<< You need to provide this function just one.
   //
   template<enum SERIAL_TYPE s> typename serial_type<s>::type enc_dec(typename serial_type<s>::type p) {
      p = ::enc_dec(p,a);
      p = ::enc_dec(p,b);
      p = ::enc_dec_lba(p,lba);
      p = ::enc_dec_range(p,start,end);
      return p;
   }
};

//
// This macro connects the global enc_dec to the member function.
// One of these per struct declaration
//
DEFINE_STRUCT_ENC_DEC(astruct);


//
// Here's a simple test program. The real encode/decode framework needs to be connected to bufferlist using the pseudo-code
// that I documented in my previous email.
//

int main(int argc,char **argv) {

   astruct a;
   a.a = 10;
   a.b.insert(2);
   a.b.insert(3);
   a.lba = 12;

   size_t s = a.enc_dec<ESTIMATE>(size_t(0));
   cout << "Estimated size is " << s << "\n";

   char buffer[100];

   char *end = a.enc_dec<ENCODE>(buffer);

   cout << "Actual storage was " << end-buffer << "\n";

   astruct b;

   (void) b.enc_dec<DECODE>(buffer); // decode it
    
   cout << "A.a = " << b.a << "\n";
   for (auto e : b.b) {
      cout << " " << e;
   }

   cout << "\n";

   cout << "a.lba = " << b.lba << "\n";
   
   return 0;
}


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, July 12, 2016 8:13 PM
> To: Sage Weil <sweil@redhat.com>; Allen Samuels
> <Allen.Samuels@sandisk.com>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: bluestore onode diet and encoding overhead
> 
> 
> 
> On 07/12/2016 08:50 PM, Sage Weil wrote:
> > On Tue, 12 Jul 2016, Allen Samuels wrote:
> >> Good analysis.
> >>
> >> My original comments about putting the oNode on a diet included the
> >> idea of a "custom" encode/decode path for certain high-usage cases.
> >> At the time, Sage resisted going down that path hoping that a more
> >> optimized generic case would get the job done. Your analysis shows
> >> that while we've achieved significant space reduction this has come
> >> at the expense of CPU time -- which dominates small object
> >> performance (I suspect that eventually we'd discover that the
> >> variable length decode path would be responsible for a substantial
> >> read performance degradation also -- which may or may not be part of
> >> the read performance drop-off that you're seeing). This isn't a surprising
> result, though it is unfortunate.
> >>
> >> I believe we need to revisit the idea of custom encode/decode paths
> >> for high-usage cases, only now the gains need to be focused on CPU
> >> utilization as well as space efficiency.
> >
> > I still think we can get most or all of the way there in a generic way
> > by revising the way that we interact with bufferlist for encode and decode.
> > We haven't actually tried to optimize this yet, and the current code
> > is pretty horribly inefficient (asserts all over the place, and many
> > layers of pointer indirection to do a simple append).  I think we need
> > to do two
> > things:
> >
> > 1) decode path: optimize the iterator class so that it has a const
> > char *current and const char *current_end that point into the current
> > buffer::ptr.  This way any decode will have a single pointer
> > add+comparison to ensure there is enough data to copy before falling
> > add+into
> > the slow path (partial buffer, move to next buffer, etc.).
> >
> 
> I don't have a good sense yet for how much this is hurting us in the read
> path.  We screwed something up in the last couple of weeks and small reads
> are quite slow.
> 
> > 2) Having that comparison is still not ideal, but we shoudl consider
> > ways to get around that too.  For example, if we know that we are
> > going to decode N M-byte things, we could do an iterator 'reserve' or
> > 'check' that ensures we have a valid pointer for that much and then
> > proceed without checks.  The interface here would be tricky, though,
> > since in the slow case we'll span buffers and need to magically fall
> > back to a different decode path (hard to maintain) or do a temporary
> > copy (probably faster but we need to ensure the iterator owns it and
> > frees is later).  I'd say this is step 2 and optional; step 1 will have the most
> benefit.
> >
> > 3) encode path: currently all encode methods take a bufferlist& and
> > the bufferlist itself as an append buffer.  I think this is flawed and
> > limiting.  Instead, we should make a new class called
> > buffer::list::appender (or similar) and templatize the encode methods
> > so they can take a safe_appender (which does bounds checking) or an
> > unsafe_appender (which does not).  For the latter, the user takes
> > responsibility for making sure there is enough space by doing a
> > reserve() type call which returns an unsafe_appender, and it's their
> > job to make sure they don't shove too much data into it.  That should
> > make the encode path a memcpy + ptr increment (for savvy/optimized
> callers).
> 
> Seems reasonable and similar in performance to what Piotr and I were
> discussing this morning.  As a very simple test I was thinking of doing a quick
> size computation and then passing that in to increase the append_buffer size
> when the bufferlist is created in Bluestore::_txc_write_nodes.  His idea went
> a bit farther to break the encapsulation, compute the fully encoded
> message, and dump it directly into a buffer of a computed size without the
> extra assert checks or bounds checking.  Obviously his idea would be faster
> but more work.
> 
> It sounds like your solution would be similar but a bit more formalized.
> 
> >
> > I suggest we use bluestore as a test case to make the interfaces work
> > and be fast.  If we succeed we can take advantage of it across the
> > reset of the code base as well.
> 
> Do we have other places in the code with similar byte append behavior?
> That's what's really killing us I think, especially with how small the new
> append_buffer is when you run out of space when appending bytes.
> 
> >
> > That's my thinking, at least.  I haven't had time to prototype it out
> > yet, but I think our goal should be to make the encode/decode paths
> > capable of being a memcpy + ptr addition in the fast path, and let
> > that guide the interface...
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-14  5:52       ` Allen Samuels
@ 2016-07-14 11:15         ` Mark Nelson
  2016-07-14 14:10           ` Allen Samuels
                             ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Mark Nelson @ 2016-07-14 11:15 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel

On 07/14/2016 12:52 AM, Allen Samuels wrote:
> As promised, here's some code that hacks out a new encode/decode framework. That has the advantage of only having to list the fields of a struct once and is pretty much guaranteed to never overrun a buffer....
>
> Comments are requested :)

It compiles! :D

I looked over the code, but I want to look it over again after I've had 
my coffee since I'm still shaking the cobwebs out.  Would the idea here 
be that if you are doing varint encoding for example that you always 
allocate the buffer based on ESTIMATE (also taking into account the 
encoding overhead), but typically expect a much smaller encoding?

As it is, it's very clever.

Mark

>
>
> #include <iostream>
> #include <fstream>
> #include <set>
> #include <string>
> #include <string.h>
>
> /*******************************************************
>
>
>    New fast encode/decode framework.
>
>    The entire framework is built around the idea that each object has three operations:
>
>      ESTIMATE  -- worst-case estimate of the amount of storage required for this object
>      ENCODE    -- encode object into buffer of size ESTIMATE
>      DECODE    -- encode object from buffer of size actual.
>
>    Each object has a single templated function that actually provides all three operations in a single set of code.
>    But doing this, it's pretty much guaranteed that the ESTIMATE and the ENCODE code are in harmony (i.e. that the estimate is correct)
>    it also saves a lot of typing/reading...
>
>    Generally, all three operations are provided on a single function name with the input and return parameters overloaded to distinguish them.
>
>    It's observed that for each of the three operations there is a single value which needs to be transmitted between each of the micro-encode/decode calls
>    Yes, this is confusing, but let's look at a simple example
>
>     struct simple {
>       int a;
>       float b;
>       string c;
>       set<int> d;
>     };
>
>     To encode this struct we generate a function that does the micro-encoding of each of the fields of the struct
>     Here's an example of a function that does the ESTIMATE operation.
>
>     size_t simple::estimate() {
>        return
>           sizeof(a) +
>           sizeof(b) +
>           c.size() +
>           d.size() * sizeof(int);
>     }
>
>     We're going to re-write it as:
>
>     size_t simple::estimate(size_t p) {
>        p = estimate(p,a);
>        p = estimate(p,b);
>        p = estimate(p,c);
>        p = estimate(p,d);
>        return p;
>     }
>
>     assuming that the sorta function:
>
>     template<typename t> size_t estimate(size_t p,t& o) { return p + sizeof(o); }
>     template<typename t> size_t estimate(size_t p,set<t>& o) { return p + o.size() * sizeof(t); }
>
>
>     similarly, the encode operation is represented as:
>
>     char * simple::encode(char *p) {
>        p = encode(p,a);
>        p = encode(p,b);
>        p = encode(p,c);
>        p = encode(p,d);
>        return p;
>     }
>
>     similarly, the decode operation is represented as:
>
>     const char * simple::decode(const char *p) {
>        p = decode(p,a);
>        p = decode(p,b);
>        p = decode(p,c);
>        p = decode(p,d);
>        return p;
>     }
>
>
> You can now see that it's possible to create a single function that does all three operations in a single block
> of code, provided that you can fiddle the input/output parameter types appropriately.
>
> In essence the pattern is
>
>     p = enc_dec(p,struct_field_1);
>     p = enc_dec(p,struct_field_2);
>     p = enc_dec(p,struct_field_3);
>
> With the type of p being set differently for each operation, i.e.,
>     for ESTIMATE, p = size_t
>     for ENCODE,   p = char *
>     for DECODE,   p = const char *
>
> This is the essence of how the encode/decode framework operates. Though there is some more sophistication...
>
> ----------------------
>
> We also want to allow the encode/decode machinery to be per-type and to operate
>
> *****************************************************************************/
>
> using namespace std;
>
> //
> // Just like the existing encode/decode machinery. The environment provides a rich set of
> // pre-defined encodes for primitive types and containers
> //
>
> #define DEFINE_ENC_DEC_RAW(type) \
> inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
> inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p + sizeof(type); } \
> inline const char *enc_dec(const char *p,type &o) { o = *(const type *)p; return p + sizeof(type); }
>
> DEFINE_ENC_DEC_RAW(int);
> DEFINE_ENC_DEC_RAW(size_t);
>
> //
> // String encode/decode (Yea, I know size_t isn't portable -- this is an EXAMPLE man...)
> //
> inline size_t enc_dec(size_t p,string& s) { return p + sizeof(size_t) + s.size(); }
> inline char * enc_dec(char * p,string& s) { *(size_t *)p = s.size(); memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p + sizeof(size_t) + s.size(); }
> inline const char *enc_dec(const char *p,string& s) { s = string(p + sizeof(size_t),*(size_t *)p); return p + sizeof(size_t) + s.size(); }
>
> //
> // Let's do a container.
> //
> // One of the problems with a container is that making an accurate estimate of the size
> // would theoretically require that you walk the entire container and add up the sizes of each element.
> // We probably don't want to do that. So here, I do a hack that just assumes that I can fake up a individual element
> // and multiple that by the number of elements in a container. This hack works anytime that the estimate function
> // for the contained type has a fixed maximum size. BTW, this is safe, if the contained type has a variable size
> //  (like set<string>) then it will fault out the first time you run it.
> //
> // Naturally, something like set<string> or map<string,string> is a highly desirable thing to be able to encode/decode
> // there's no reason that you can't create a enc_dec_slow function that properly computes the maximum size by walking the container.
> //
> template<typename t>
> inline size_t enc_dec(size_t p,set<t>& s) { return p + sizeof(size_t) + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }
>
> template<typename t>
> inline char *enc_dec(char *p,set<t>& s) {
>    size_t sz = s.size();
>    p = enc_dec(p,sz);
>    for (const t& e : s) {
>       p = enc_dec(p,const_cast<t&>(e));
>    }
>    return p;
> }
>
> template<typename t>
> inline const char *enc_dec(const char *p,set<t>&s) {
>    size_t sz;
>    p = enc_dec(p,sz);
>    while (sz--) {
>       t temp;
>       p = enc_dec(p,temp);
>       s.insert(temp);
>    }
>    return p;
> }
>
> //
> // Specialized encode/decode for a single data type. These are invoked explicitly...
> //
> inline size_t enc_dec_lba(size_t p,int& lba) {
>    return p + sizeof(lba); // Max....
> }
>
> inline char * enc_dec_lba(char *p,int& lba) {
>    *p = 15;
>    return p + 1; // blah blah
> }
>
> inline const char *enc_dec_lba(const char *p,int& lba) {
>    lba = *p;
>    return p+1;
> }
>
> //
> // Specialized encode/decode for more sophisticated things primitives.
> //
> // Here's an example of a encode/decoder for a pair of fields
> //
> inline size_t enc_dec_range(size_t p,short& start,short& end) {
>    return p + 2 * sizeof(short);
> }
>
> inline char *enc_dec_range(char *p, short& start, short& end) {
>    short *s = (short *) p;
>    s[0] = start;
>    s[1] = end;
>    return p + sizeof(short) * 2;
> }
>
> inline const char *enc_dec_range(const char *p,short& start, short& end) {
>    start = *(short *)p;
>    end   = *(short *)(p + sizeof(short));
>    return p + 2*sizeof(short);
> }
>
>
> //
> // Some C++ template wizardry to make the single encode/decode function possible.
> //
> enum SERIAL_TYPE {
>    ESTIMATE,
>    ENCODE,
>    DECODE
> };
>
> template <enum SERIAL_TYPE s> struct serial_type;
>
> template<> struct serial_type<ESTIMATE> { typedef size_t type; };
> template<> struct serial_type<ENCODE>   { typedef char * type; };
> template<> struct serial_type<DECODE>   { typedef const char *type; };
>
> //
> // This macro is the key, it connects the external non-member function to the correct member function.
> //
> #define DEFINE_STRUCT_ENC_DEC(s) \
> inline size_t      enc_dec(size_t p, s &o) { return o.enc_dec<ESTIMATE>(p); } \
> inline char *      enc_dec(char *p , s &o)  { return o.enc_dec<ENCODE>(p); } \
> inline const char *enc_dec(const char *p,s &o)  { return o.enc_dec<DECODE>(p); }
>
> //
> // Our example structure
> //
> struct astruct {
>    int a;
>    set<int> b;
>    int lba;
>    short start,end;
>
>    //
>    // <<<<< You need to provide this function just one.
>    //
>    template<enum SERIAL_TYPE s> typename serial_type<s>::type enc_dec(typename serial_type<s>::type p) {
>       p = ::enc_dec(p,a);
>       p = ::enc_dec(p,b);
>       p = ::enc_dec_lba(p,lba);
>       p = ::enc_dec_range(p,start,end);
>       return p;
>    }
> };
>
> //
> // This macro connects the global enc_dec to the member function.
> // One of these per struct declaration
> //
> DEFINE_STRUCT_ENC_DEC(astruct);
>
>
> //
> // Here's a simple test program. The real encode/decode framework needs to be connected to bufferlist using the pseudo-code
> // that I documented in my previous email.
> //
>
> int main(int argc,char **argv) {
>
>    astruct a;
>    a.a = 10;
>    a.b.insert(2);
>    a.b.insert(3);
>    a.lba = 12;
>
>    size_t s = a.enc_dec<ESTIMATE>(size_t(0));
>    cout << "Estimated size is " << s << "\n";
>
>    char buffer[100];
>
>    char *end = a.enc_dec<ENCODE>(buffer);
>
>    cout << "Actual storage was " << end-buffer << "\n";
>
>    astruct b;
>
>    (void) b.enc_dec<DECODE>(buffer); // decode it
>
>    cout << "A.a = " << b.a << "\n";
>    for (auto e : b.b) {
>       cout << " " << e;
>    }
>
>    cout << "\n";
>
>    cout << "a.lba = " << b.lba << "\n";
>
>    return 0;
> }
>
>
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mnelson@redhat.com]
>> Sent: Tuesday, July 12, 2016 8:13 PM
>> To: Sage Weil <sweil@redhat.com>; Allen Samuels
>> <Allen.Samuels@sandisk.com>
>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: Re: bluestore onode diet and encoding overhead
>>
>>
>>
>> On 07/12/2016 08:50 PM, Sage Weil wrote:
>>> On Tue, 12 Jul 2016, Allen Samuels wrote:
>>>> Good analysis.
>>>>
>>>> My original comments about putting the oNode on a diet included the
>>>> idea of a "custom" encode/decode path for certain high-usage cases.
>>>> At the time, Sage resisted going down that path hoping that a more
>>>> optimized generic case would get the job done. Your analysis shows
>>>> that while we've achieved significant space reduction this has come
>>>> at the expense of CPU time -- which dominates small object
>>>> performance (I suspect that eventually we'd discover that the
>>>> variable length decode path would be responsible for a substantial
>>>> read performance degradation also -- which may or may not be part of
>>>> the read performance drop-off that you're seeing). This isn't a surprising
>> result, though it is unfortunate.
>>>>
>>>> I believe we need to revisit the idea of custom encode/decode paths
>>>> for high-usage cases, only now the gains need to be focused on CPU
>>>> utilization as well as space efficiency.
>>>
>>> I still think we can get most or all of the way there in a generic way
>>> by revising the way that we interact with bufferlist for encode and decode.
>>> We haven't actually tried to optimize this yet, and the current code
>>> is pretty horribly inefficient (asserts all over the place, and many
>>> layers of pointer indirection to do a simple append).  I think we need
>>> to do two
>>> things:
>>>
>>> 1) decode path: optimize the iterator class so that it has a const
>>> char *current and const char *current_end that point into the current
>>> buffer::ptr.  This way any decode will have a single pointer
>>> add+comparison to ensure there is enough data to copy before falling
>>> add+into
>>> the slow path (partial buffer, move to next buffer, etc.).
>>>
>>
>> I don't have a good sense yet for how much this is hurting us in the read
>> path.  We screwed something up in the last couple of weeks and small reads
>> are quite slow.
>>
>>> 2) Having that comparison is still not ideal, but we shoudl consider
>>> ways to get around that too.  For example, if we know that we are
>>> going to decode N M-byte things, we could do an iterator 'reserve' or
>>> 'check' that ensures we have a valid pointer for that much and then
>>> proceed without checks.  The interface here would be tricky, though,
>>> since in the slow case we'll span buffers and need to magically fall
>>> back to a different decode path (hard to maintain) or do a temporary
>>> copy (probably faster but we need to ensure the iterator owns it and
>>> frees is later).  I'd say this is step 2 and optional; step 1 will have the most
>> benefit.
>>>
>>> 3) encode path: currently all encode methods take a bufferlist& and
>>> the bufferlist itself as an append buffer.  I think this is flawed and
>>> limiting.  Instead, we should make a new class called
>>> buffer::list::appender (or similar) and templatize the encode methods
>>> so they can take a safe_appender (which does bounds checking) or an
>>> unsafe_appender (which does not).  For the latter, the user takes
>>> responsibility for making sure there is enough space by doing a
>>> reserve() type call which returns an unsafe_appender, and it's their
>>> job to make sure they don't shove too much data into it.  That should
>>> make the encode path a memcpy + ptr increment (for savvy/optimized
>> callers).
>>
>> Seems reasonable and similar in performance to what Piotr and I were
>> discussing this morning.  As a very simple test I was thinking of doing a quick
>> size computation and then passing that in to increase the append_buffer size
>> when the bufferlist is created in Bluestore::_txc_write_nodes.  His idea went
>> a bit farther to break the encapsulation, compute the fully encoded
>> message, and dump it directly into a buffer of a computed size without the
>> extra assert checks or bounds checking.  Obviously his idea would be faster
>> but more work.
>>
>> It sounds like your solution would be similar but a bit more formalized.
>>
>>>
>>> I suggest we use bluestore as a test case to make the interfaces work
>>> and be fast.  If we succeed we can take advantage of it across the
>>> reset of the code base as well.
>>
>> Do we have other places in the code with similar byte append behavior?
>> That's what's really killing us I think, especially with how small the new
>> append_buffer is when you run out of space when appending bytes.
>>
>>>
>>> That's my thinking, at least.  I haven't had time to prototype it out
>>> yet, but I think our goal should be to make the encode/decode paths
>>> capable of being a memcpy + ptr addition in the fast path, and let
>>> that guide the interface...
>>>
>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-14 11:15         ` Mark Nelson
@ 2016-07-14 14:10           ` Allen Samuels
  2016-08-12 16:18             ` Sage Weil
  2016-07-14 14:14           ` Allen Samuels
  2016-07-14 16:20           ` Allen Samuels
  2 siblings, 1 reply; 39+ messages in thread
From: Allen Samuels @ 2016-07-14 14:10 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil; +Cc: ceph-devel

Yes, I did actually run the code before I posted it.

w.r.t. varint encoding. You have two choices w.r.t. a variable length encoded, you could examine the data to accurately predict the output size OR you could just return a constant that represents the worst-case (max) size.  For individual fields, it probably doesn't matter what you chose, but for fields that are part of something in a container, you probably want the option of NOT running down the container to size up each element -- so you'd just choose the worst-case size for the estimator.

Though this code doesn't show it, I wrote some pseudo-code in a previous e-mail that glues this framework into the bufferlist stuff. That pseudo code is well prepared for estimate functions that are too large (indeed, it expects that to happen) and it naturally handles buffer overrun detection.

I didn't describe it in the example,  but this framework very naturally handles versioning, you just add some code like:

Struct abc {
   Int version;
   Int a;
   Int b; 
   ..... enc_dec(p) {
      ::enc_dec(p, version);
      ::enc_dec(p, a);
      If (s != DECODE || version > 5) ::enc_dec(p, b); // This field is present in all estimate and encode operations, but only in decode operations when version is > 5
   }
};

What this framework doesn't yet handle very well is situations where you have a container with a contained type that is a primitive (i.e., uint8) and you want that contained type to be custom encoded.
Currently, the only solution is replace the contained primitive type with a class wrapper. Unfortunately a typedef is NOT sufficient to differentiate it.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Thursday, July 14, 2016 4:16 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sweil@redhat.com>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: bluestore onode diet and encoding overhead
> 
> On 07/14/2016 12:52 AM, Allen Samuels wrote:
> > As promised, here's some code that hacks out a new encode/decode
> framework. That has the advantage of only having to list the fields of a struct
> once and is pretty much guaranteed to never overrun a buffer....
> >
> > Comments are requested :)
> 
> It compiles! :D
> 
> I looked over the code, but I want to look it over again after I've had my
> coffee since I'm still shaking the cobwebs out.  Would the idea here be that if
> you are doing varint encoding for example that you always allocate the
> buffer based on ESTIMATE (also taking into account the encoding overhead),
> but typically expect a much smaller encoding?
> 
> As it is, it's very clever.
> 
> Mark
> 
> >
> >
> > #include <iostream>
> > #include <fstream>
> > #include <set>
> > #include <string>
> > #include <string.h>
> >
> > /*******************************************************
> >
> >
> >    New fast encode/decode framework.
> >
> >    The entire framework is built around the idea that each object has three
> operations:
> >
> >      ESTIMATE  -- worst-case estimate of the amount of storage required for
> this object
> >      ENCODE    -- encode object into buffer of size ESTIMATE
> >      DECODE    -- encode object from buffer of size actual.
> >
> >    Each object has a single templated function that actually provides all three
> operations in a single set of code.
> >    But doing this, it's pretty much guaranteed that the ESTIMATE and the
> ENCODE code are in harmony (i.e. that the estimate is correct)
> >    it also saves a lot of typing/reading...
> >
> >    Generally, all three operations are provided on a single function name
> with the input and return parameters overloaded to distinguish them.
> >
> >    It's observed that for each of the three operations there is a single value
> which needs to be transmitted between each of the micro-encode/decode
> calls
> >    Yes, this is confusing, but let's look at a simple example
> >
> >     struct simple {
> >       int a;
> >       float b;
> >       string c;
> >       set<int> d;
> >     };
> >
> >     To encode this struct we generate a function that does the micro-
> encoding of each of the fields of the struct
> >     Here's an example of a function that does the ESTIMATE operation.
> >
> >     size_t simple::estimate() {
> >        return
> >           sizeof(a) +
> >           sizeof(b) +
> >           c.size() +
> >           d.size() * sizeof(int);
> >     }
> >
> >     We're going to re-write it as:
> >
> >     size_t simple::estimate(size_t p) {
> >        p = estimate(p,a);
> >        p = estimate(p,b);
> >        p = estimate(p,c);
> >        p = estimate(p,d);
> >        return p;
> >     }
> >
> >     assuming that the sorta function:
> >
> >     template<typename t> size_t estimate(size_t p,t& o) { return p +
> sizeof(o); }
> >     template<typename t> size_t estimate(size_t p,set<t>& o) { return
> > p + o.size() * sizeof(t); }
> >
> >
> >     similarly, the encode operation is represented as:
> >
> >     char * simple::encode(char *p) {
> >        p = encode(p,a);
> >        p = encode(p,b);
> >        p = encode(p,c);
> >        p = encode(p,d);
> >        return p;
> >     }
> >
> >     similarly, the decode operation is represented as:
> >
> >     const char * simple::decode(const char *p) {
> >        p = decode(p,a);
> >        p = decode(p,b);
> >        p = decode(p,c);
> >        p = decode(p,d);
> >        return p;
> >     }
> >
> >
> > You can now see that it's possible to create a single function that
> > does all three operations in a single block of code, provided that you can
> fiddle the input/output parameter types appropriately.
> >
> > In essence the pattern is
> >
> >     p = enc_dec(p,struct_field_1);
> >     p = enc_dec(p,struct_field_2);
> >     p = enc_dec(p,struct_field_3);
> >
> > With the type of p being set differently for each operation, i.e.,
> >     for ESTIMATE, p = size_t
> >     for ENCODE,   p = char *
> >     for DECODE,   p = const char *
> >
> > This is the essence of how the encode/decode framework operates.
> Though there is some more sophistication...
> >
> > ----------------------
> >
> > We also want to allow the encode/decode machinery to be per-type and
> > to operate
> >
> >
> **********************************************************
> ************
> > *******/
> >
> > using namespace std;
> >
> > //
> > // Just like the existing encode/decode machinery. The environment
> > provides a rich set of // pre-defined encodes for primitive types and
> > containers //
> >
> > #define DEFINE_ENC_DEC_RAW(type) \
> > inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
> > inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p +
> sizeof(type); } \
> > inline const char *enc_dec(const char *p,type &o) { o = *(const type
> > *)p; return p + sizeof(type); }
> >
> > DEFINE_ENC_DEC_RAW(int);
> > DEFINE_ENC_DEC_RAW(size_t);
> >
> > //
> > // String encode/decode (Yea, I know size_t isn't portable -- this is
> > an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s) {
> > return p + sizeof(size_t) + s.size(); } inline char * enc_dec(char *
> > p,string& s) { *(size_t *)p = s.size();
> > memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p + sizeof(size_t)
> > + s.size(); } inline const char *enc_dec(const char *p,string& s) { s
> > = string(p + sizeof(size_t),*(size_t *)p); return p + sizeof(size_t) +
> > s.size(); }
> >
> > //
> > // Let's do a container.
> > //
> > // One of the problems with a container is that making an accurate
> > estimate of the size // would theoretically require that you walk the entire
> container and add up the sizes of each element.
> > // We probably don't want to do that. So here, I do a hack that just
> > assumes that I can fake up a individual element // and multiple that
> > by the number of elements in a container. This hack works anytime that
> > the estimate function // for the contained type has a fixed maximum size.
> BTW, this is safe, if the contained type has a variable size //  (like set<string>)
> then it will fault out the first time you run it.
> > //
> > // Naturally, something like set<string> or map<string,string> is a
> > highly desirable thing to be able to encode/decode // there's no reason
> that you can't create a enc_dec_slow function that properly computes the
> maximum size by walking the container.
> > //
> > template<typename t>
> > inline size_t enc_dec(size_t p,set<t>& s) { return p + sizeof(size_t)
> > + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }
> >
> > template<typename t>
> > inline char *enc_dec(char *p,set<t>& s) {
> >    size_t sz = s.size();
> >    p = enc_dec(p,sz);
> >    for (const t& e : s) {
> >       p = enc_dec(p,const_cast<t&>(e));
> >    }
> >    return p;
> > }
> >
> > template<typename t>
> > inline const char *enc_dec(const char *p,set<t>&s) {
> >    size_t sz;
> >    p = enc_dec(p,sz);
> >    while (sz--) {
> >       t temp;
> >       p = enc_dec(p,temp);
> >       s.insert(temp);
> >    }
> >    return p;
> > }
> >
> > //
> > // Specialized encode/decode for a single data type. These are invoked
> explicitly...
> > //
> > inline size_t enc_dec_lba(size_t p,int& lba) {
> >    return p + sizeof(lba); // Max....
> > }
> >
> > inline char * enc_dec_lba(char *p,int& lba) {
> >    *p = 15;
> >    return p + 1; // blah blah
> > }
> >
> > inline const char *enc_dec_lba(const char *p,int& lba) {
> >    lba = *p;
> >    return p+1;
> > }
> >
> > //
> > // Specialized encode/decode for more sophisticated things primitives.
> > //
> > // Here's an example of a encode/decoder for a pair of fields //
> > inline size_t enc_dec_range(size_t p,short& start,short& end) {
> >    return p + 2 * sizeof(short);
> > }
> >
> > inline char *enc_dec_range(char *p, short& start, short& end) {
> >    short *s = (short *) p;
> >    s[0] = start;
> >    s[1] = end;
> >    return p + sizeof(short) * 2;
> > }
> >
> > inline const char *enc_dec_range(const char *p,short& start, short& end) {
> >    start = *(short *)p;
> >    end   = *(short *)(p + sizeof(short));
> >    return p + 2*sizeof(short);
> > }
> >
> >
> > //
> > // Some C++ template wizardry to make the single encode/decode
> function possible.
> > //
> > enum SERIAL_TYPE {
> >    ESTIMATE,
> >    ENCODE,
> >    DECODE
> > };
> >
> > template <enum SERIAL_TYPE s> struct serial_type;
> >
> > template<> struct serial_type<ESTIMATE> { typedef size_t type; };
> > template<> struct serial_type<ENCODE>   { typedef char * type; };
> > template<> struct serial_type<DECODE>   { typedef const char *type; };
> >
> > //
> > // This macro is the key, it connects the external non-member function to
> the correct member function.
> > //
> > #define DEFINE_STRUCT_ENC_DEC(s) \
> > inline size_t      enc_dec(size_t p, s &o) { return o.enc_dec<ESTIMATE>(p); }
> \
> > inline char *      enc_dec(char *p , s &o)  { return o.enc_dec<ENCODE>(p); }
> \
> > inline const char *enc_dec(const char *p,s &o)  { return
> > o.enc_dec<DECODE>(p); }
> >
> > //
> > // Our example structure
> > //
> > struct astruct {
> >    int a;
> >    set<int> b;
> >    int lba;
> >    short start,end;
> >
> >    //
> >    // <<<<< You need to provide this function just one.
> >    //
> >    template<enum SERIAL_TYPE s> typename serial_type<s>::type
> enc_dec(typename serial_type<s>::type p) {
> >       p = ::enc_dec(p,a);
> >       p = ::enc_dec(p,b);
> >       p = ::enc_dec_lba(p,lba);
> >       p = ::enc_dec_range(p,start,end);
> >       return p;
> >    }
> > };
> >
> > //
> > // This macro connects the global enc_dec to the member function.
> > // One of these per struct declaration //
> > DEFINE_STRUCT_ENC_DEC(astruct);
> >
> >
> > //
> > // Here's a simple test program. The real encode/decode framework
> > needs to be connected to bufferlist using the pseudo-code // that I
> documented in my previous email.
> > //
> >
> > int main(int argc,char **argv) {
> >
> >    astruct a;
> >    a.a = 10;
> >    a.b.insert(2);
> >    a.b.insert(3);
> >    a.lba = 12;
> >
> >    size_t s = a.enc_dec<ESTIMATE>(size_t(0));
> >    cout << "Estimated size is " << s << "\n";
> >
> >    char buffer[100];
> >
> >    char *end = a.enc_dec<ENCODE>(buffer);
> >
> >    cout << "Actual storage was " << end-buffer << "\n";
> >
> >    astruct b;
> >
> >    (void) b.enc_dec<DECODE>(buffer); // decode it
> >
> >    cout << "A.a = " << b.a << "\n";
> >    for (auto e : b.b) {
> >       cout << " " << e;
> >    }
> >
> >    cout << "\n";
> >
> >    cout << "a.lba = " << b.lba << "\n";
> >
> >    return 0;
> > }
> >
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> >> -----Original Message-----
> >> From: Mark Nelson [mailto:mnelson@redhat.com]
> >> Sent: Tuesday, July 12, 2016 8:13 PM
> >> To: Sage Weil <sweil@redhat.com>; Allen Samuels
> >> <Allen.Samuels@sandisk.com>
> >> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> >> Subject: Re: bluestore onode diet and encoding overhead
> >>
> >>
> >>
> >> On 07/12/2016 08:50 PM, Sage Weil wrote:
> >>> On Tue, 12 Jul 2016, Allen Samuels wrote:
> >>>> Good analysis.
> >>>>
> >>>> My original comments about putting the oNode on a diet included the
> >>>> idea of a "custom" encode/decode path for certain high-usage cases.
> >>>> At the time, Sage resisted going down that path hoping that a more
> >>>> optimized generic case would get the job done. Your analysis shows
> >>>> that while we've achieved significant space reduction this has come
> >>>> at the expense of CPU time -- which dominates small object
> >>>> performance (I suspect that eventually we'd discover that the
> >>>> variable length decode path would be responsible for a substantial
> >>>> read performance degradation also -- which may or may not be part of
> >>>> the read performance drop-off that you're seeing). This isn't a
> surprising
> >> result, though it is unfortunate.
> >>>>
> >>>> I believe we need to revisit the idea of custom encode/decode paths
> >>>> for high-usage cases, only now the gains need to be focused on CPU
> >>>> utilization as well as space efficiency.
> >>>
> >>> I still think we can get most or all of the way there in a generic way
> >>> by revising the way that we interact with bufferlist for encode and
> decode.
> >>> We haven't actually tried to optimize this yet, and the current code
> >>> is pretty horribly inefficient (asserts all over the place, and many
> >>> layers of pointer indirection to do a simple append).  I think we need
> >>> to do two
> >>> things:
> >>>
> >>> 1) decode path: optimize the iterator class so that it has a const
> >>> char *current and const char *current_end that point into the current
> >>> buffer::ptr.  This way any decode will have a single pointer
> >>> add+comparison to ensure there is enough data to copy before falling
> >>> add+into
> >>> the slow path (partial buffer, move to next buffer, etc.).
> >>>
> >>
> >> I don't have a good sense yet for how much this is hurting us in the read
> >> path.  We screwed something up in the last couple of weeks and small
> reads
> >> are quite slow.
> >>
> >>> 2) Having that comparison is still not ideal, but we shoudl consider
> >>> ways to get around that too.  For example, if we know that we are
> >>> going to decode N M-byte things, we could do an iterator 'reserve' or
> >>> 'check' that ensures we have a valid pointer for that much and then
> >>> proceed without checks.  The interface here would be tricky, though,
> >>> since in the slow case we'll span buffers and need to magically fall
> >>> back to a different decode path (hard to maintain) or do a temporary
> >>> copy (probably faster but we need to ensure the iterator owns it and
> >>> frees is later).  I'd say this is step 2 and optional; step 1 will have the most
> >> benefit.
> >>>
> >>> 3) encode path: currently all encode methods take a bufferlist& and
> >>> the bufferlist itself as an append buffer.  I think this is flawed and
> >>> limiting.  Instead, we should make a new class called
> >>> buffer::list::appender (or similar) and templatize the encode methods
> >>> so they can take a safe_appender (which does bounds checking) or an
> >>> unsafe_appender (which does not).  For the latter, the user takes
> >>> responsibility for making sure there is enough space by doing a
> >>> reserve() type call which returns an unsafe_appender, and it's their
> >>> job to make sure they don't shove too much data into it.  That should
> >>> make the encode path a memcpy + ptr increment (for savvy/optimized
> >> callers).
> >>
> >> Seems reasonable and similar in performance to what Piotr and I were
> >> discussing this morning.  As a very simple test I was thinking of doing a
> quick
> >> size computation and then passing that in to increase the append_buffer
> size
> >> when the bufferlist is created in Bluestore::_txc_write_nodes.  His idea
> went
> >> a bit farther to break the encapsulation, compute the fully encoded
> >> message, and dump it directly into a buffer of a computed size without
> the
> >> extra assert checks or bounds checking.  Obviously his idea would be
> faster
> >> but more work.
> >>
> >> It sounds like your solution would be similar but a bit more formalized.
> >>
> >>>
> >>> I suggest we use bluestore as a test case to make the interfaces work
> >>> and be fast.  If we succeed we can take advantage of it across the
> >>> reset of the code base as well.
> >>
> >> Do we have other places in the code with similar byte append behavior?
> >> That's what's really killing us I think, especially with how small the new
> >> append_buffer is when you run out of space when appending bytes.
> >>
> >>>
> >>> That's my thinking, at least.  I haven't had time to prototype it out
> >>> yet, but I think our goal should be to make the encode/decode paths
> >>> capable of being a memcpy + ptr addition in the fast path, and let
> >>> that guide the interface...
> >>>
> >>> sage
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> >> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >>>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-14 11:15         ` Mark Nelson
  2016-07-14 14:10           ` Allen Samuels
@ 2016-07-14 14:14           ` Allen Samuels
  2016-07-14 16:20           ` Allen Samuels
  2 siblings, 0 replies; 39+ messages in thread
From: Allen Samuels @ 2016-07-14 14:14 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil; +Cc: ceph-devel

w.r.t. allocation, yes -- exactly. Here's a snippet from my previous e-mail that completes the picture:

Fundamentally, the problem that we have is lack of knowledge about how much data is going to be encoded (and to a lesser extent, decoded). Currently, we basically check for space on each micro-encode operation. Clearly those checks have to be eliminated. That leaves us with basically three choices:

(1) Pre-allocate a buffer sufficiently large.
(2) Check for sufficient space at "strategic" places in the code.
(3) Fiddle with the CPU memory mapping tables to generate a page fault when we run off the end of our buffer (which clearly must be page aligned, blah blah blah) and then contrive to auto-allocate and restart.

I reject (3) as too complex and not worth the effort.

I reject (2) as difficult to maintain and difficult to determine where the "strategic" points in the code are that retain correctness, but happen infrequently enough to minimize CPU utilization. 

That leaves us with (1).

Method (1) must consist of making a worst-case prediction of the size of the encoded data. Encoded the data into the allocated buffer and then "freeing" the unused portion (since tight encoding of the data is content-dependent in length). The encode and free processes are relatively straightforward, but the prediction process has some options to explore.

(1a) Constant "maximum". We could easily establish something like a 128K or 256K constant upper limit and just use a buffer of that size.
(1b) computed "maximum", For simple objects (a few fields/containers) it's relatively easy to add up the necessary sizes to generate an estimate. For complex objects, this is error prone because you're duplicating the encode with the estimate-size logic.

Right now, I'm trying to concoct some combinations of C++ templates that's lets me merge the estimate-size, encode and decode functions into a common routine so that we can avoid this system-matic error. Stay tuned.

One of the dangers of the prediction scheme is what if the prediction is incorrect -- too small. Then you'll get buffer overrun. It's been suggested that we insert some special debug-assert code to detect that situation which is only enabled at compile-time. I believe this is NOT the right solution. The buffer overrun problem is data dependent. That means it will be especially hard to debug in the field as what you're looking for is essentially silent data corruption.

I believe that the best solution is to simply check the fully-encoded buffer against the estimate that was made. If the estimate is too small, then assert-out. Leave this in production code. If we encounter an encode-buffer overrun at least we'll now that was the source of the problem and we can fix it. (I assert that if you see this, it'll be pretty obvious where it went wrong -- especially if I'm able to create a unified estimate-size, encode and decode function.

One last problem to solve is the decode problem. We need to know how much data is in a fast-encode buffer in order to ensure that it's not fragmented in the buffer::list. This is relatively easy if the encode leaves a "total bytes" field at the start of the operation.

Now we can see the pseudo-code for fast-encode.

Void fast_encode(object& o, bufferptr& b) {
   size_t estimate = o.estimate_sizeof_fast_encode() + sizeof(int);  // Extra int is for explicit size of the overall buffer.
   char * buf_start = b.push_back (estimate);	// returns pointer to a block of memory appended to the end that's of the specified size;
   char * buf_end  = o.do_fast_encode(buf_start + sizeof(int)); // starts serialization into the address pointed at by the input parameter, returns "next" pointer, i.e., pointer to next unused byte 
   size_t consumed = buf_end - buf_start; // Compute consumed bytes

   assert(consumed <= estimate);  // Here's where we catch a silent data corruption due to an overflow

   *(int *)buf_start = consumed;  // Total size of consumed buffer, including the starting int stored back at the start...

   size_t unused = estimate - consumed; // Amount of unused space at the end.

   b.pop_back(unused); // "free" the unused bytes at the end of the buffer.

}

Now we can see the shape of the fast_decode function.

Void fast_decode(object& o, bufferptr& b) {
   
   Int encode_size = b.pop_front_int();   // Remove first bytes where the encode left the size of the full buffer

   const char *buf_start = b.contiguous_ptr(encode_size - sizeof(int));   // return pointer to sequential buffer of specified number of bytes <<<<- here's where we might have to copy discontiguous buffers into a single buffer.

   const char *buf_end = o.fast_decode(buf_start);		              // Decode data from the buffer into the object, return the pointer to the last consumed byte.

   Size_t consumed_bytes = buf_end - buf_start;
  
  Assert(consumed_bytes == encode_size);			// Consistency check, we consumed exactly as much as we encoded (NB, I may have left off a " + sizeof(int)" for the initial buffer size.)

   b.pop_front(consumed_bytes);				// logically remove the bytes that we've consumed

}



Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Thursday, July 14, 2016 4:16 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sweil@redhat.com>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: bluestore onode diet and encoding overhead
> 
> On 07/14/2016 12:52 AM, Allen Samuels wrote:
> > As promised, here's some code that hacks out a new encode/decode
> framework. That has the advantage of only having to list the fields of a struct
> once and is pretty much guaranteed to never overrun a buffer....
> >
> > Comments are requested :)
> 
> It compiles! :D
> 
> I looked over the code, but I want to look it over again after I've had my
> coffee since I'm still shaking the cobwebs out.  Would the idea here be that if
> you are doing varint encoding for example that you always allocate the
> buffer based on ESTIMATE (also taking into account the encoding overhead),
> but typically expect a much smaller encoding?
> 
> As it is, it's very clever.
> 
> Mark
> 
> >
> >
> > #include <iostream>
> > #include <fstream>
> > #include <set>
> > #include <string>
> > #include <string.h>
> >
> > /*******************************************************
> >
> >
> >    New fast encode/decode framework.
> >
> >    The entire framework is built around the idea that each object has three
> operations:
> >
> >      ESTIMATE  -- worst-case estimate of the amount of storage required for
> this object
> >      ENCODE    -- encode object into buffer of size ESTIMATE
> >      DECODE    -- encode object from buffer of size actual.
> >
> >    Each object has a single templated function that actually provides all three
> operations in a single set of code.
> >    But doing this, it's pretty much guaranteed that the ESTIMATE and the
> ENCODE code are in harmony (i.e. that the estimate is correct)
> >    it also saves a lot of typing/reading...
> >
> >    Generally, all three operations are provided on a single function name
> with the input and return parameters overloaded to distinguish them.
> >
> >    It's observed that for each of the three operations there is a single value
> which needs to be transmitted between each of the micro-encode/decode
> calls
> >    Yes, this is confusing, but let's look at a simple example
> >
> >     struct simple {
> >       int a;
> >       float b;
> >       string c;
> >       set<int> d;
> >     };
> >
> >     To encode this struct we generate a function that does the micro-
> encoding of each of the fields of the struct
> >     Here's an example of a function that does the ESTIMATE operation.
> >
> >     size_t simple::estimate() {
> >        return
> >           sizeof(a) +
> >           sizeof(b) +
> >           c.size() +
> >           d.size() * sizeof(int);
> >     }
> >
> >     We're going to re-write it as:
> >
> >     size_t simple::estimate(size_t p) {
> >        p = estimate(p,a);
> >        p = estimate(p,b);
> >        p = estimate(p,c);
> >        p = estimate(p,d);
> >        return p;
> >     }
> >
> >     assuming that the sorta function:
> >
> >     template<typename t> size_t estimate(size_t p,t& o) { return p +
> sizeof(o); }
> >     template<typename t> size_t estimate(size_t p,set<t>& o) { return
> > p + o.size() * sizeof(t); }
> >
> >
> >     similarly, the encode operation is represented as:
> >
> >     char * simple::encode(char *p) {
> >        p = encode(p,a);
> >        p = encode(p,b);
> >        p = encode(p,c);
> >        p = encode(p,d);
> >        return p;
> >     }
> >
> >     similarly, the decode operation is represented as:
> >
> >     const char * simple::decode(const char *p) {
> >        p = decode(p,a);
> >        p = decode(p,b);
> >        p = decode(p,c);
> >        p = decode(p,d);
> >        return p;
> >     }
> >
> >
> > You can now see that it's possible to create a single function that
> > does all three operations in a single block of code, provided that you can
> fiddle the input/output parameter types appropriately.
> >
> > In essence the pattern is
> >
> >     p = enc_dec(p,struct_field_1);
> >     p = enc_dec(p,struct_field_2);
> >     p = enc_dec(p,struct_field_3);
> >
> > With the type of p being set differently for each operation, i.e.,
> >     for ESTIMATE, p = size_t
> >     for ENCODE,   p = char *
> >     for DECODE,   p = const char *
> >
> > This is the essence of how the encode/decode framework operates.
> Though there is some more sophistication...
> >
> > ----------------------
> >
> > We also want to allow the encode/decode machinery to be per-type and
> > to operate
> >
> >
> **********************************************************
> ************
> > *******/
> >
> > using namespace std;
> >
> > //
> > // Just like the existing encode/decode machinery. The environment
> > provides a rich set of // pre-defined encodes for primitive types and
> > containers //
> >
> > #define DEFINE_ENC_DEC_RAW(type) \
> > inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
> > inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p +
> sizeof(type); } \
> > inline const char *enc_dec(const char *p,type &o) { o = *(const type
> > *)p; return p + sizeof(type); }
> >
> > DEFINE_ENC_DEC_RAW(int);
> > DEFINE_ENC_DEC_RAW(size_t);
> >
> > //
> > // String encode/decode (Yea, I know size_t isn't portable -- this is
> > an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s) {
> > return p + sizeof(size_t) + s.size(); } inline char * enc_dec(char *
> > p,string& s) { *(size_t *)p = s.size();
> > memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p + sizeof(size_t)
> > + s.size(); } inline const char *enc_dec(const char *p,string& s) { s
> > = string(p + sizeof(size_t),*(size_t *)p); return p + sizeof(size_t) +
> > s.size(); }
> >
> > //
> > // Let's do a container.
> > //
> > // One of the problems with a container is that making an accurate
> > estimate of the size // would theoretically require that you walk the entire
> container and add up the sizes of each element.
> > // We probably don't want to do that. So here, I do a hack that just
> > assumes that I can fake up a individual element // and multiple that
> > by the number of elements in a container. This hack works anytime that
> > the estimate function // for the contained type has a fixed maximum size.
> BTW, this is safe, if the contained type has a variable size //  (like set<string>)
> then it will fault out the first time you run it.
> > //
> > // Naturally, something like set<string> or map<string,string> is a
> > highly desirable thing to be able to encode/decode // there's no reason
> that you can't create a enc_dec_slow function that properly computes the
> maximum size by walking the container.
> > //
> > template<typename t>
> > inline size_t enc_dec(size_t p,set<t>& s) { return p + sizeof(size_t)
> > + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }
> >
> > template<typename t>
> > inline char *enc_dec(char *p,set<t>& s) {
> >    size_t sz = s.size();
> >    p = enc_dec(p,sz);
> >    for (const t& e : s) {
> >       p = enc_dec(p,const_cast<t&>(e));
> >    }
> >    return p;
> > }
> >
> > template<typename t>
> > inline const char *enc_dec(const char *p,set<t>&s) {
> >    size_t sz;
> >    p = enc_dec(p,sz);
> >    while (sz--) {
> >       t temp;
> >       p = enc_dec(p,temp);
> >       s.insert(temp);
> >    }
> >    return p;
> > }
> >
> > //
> > // Specialized encode/decode for a single data type. These are invoked
> explicitly...
> > //
> > inline size_t enc_dec_lba(size_t p,int& lba) {
> >    return p + sizeof(lba); // Max....
> > }
> >
> > inline char * enc_dec_lba(char *p,int& lba) {
> >    *p = 15;
> >    return p + 1; // blah blah
> > }
> >
> > inline const char *enc_dec_lba(const char *p,int& lba) {
> >    lba = *p;
> >    return p+1;
> > }
> >
> > //
> > // Specialized encode/decode for more sophisticated things primitives.
> > //
> > // Here's an example of a encode/decoder for a pair of fields //
> > inline size_t enc_dec_range(size_t p,short& start,short& end) {
> >    return p + 2 * sizeof(short);
> > }
> >
> > inline char *enc_dec_range(char *p, short& start, short& end) {
> >    short *s = (short *) p;
> >    s[0] = start;
> >    s[1] = end;
> >    return p + sizeof(short) * 2;
> > }
> >
> > inline const char *enc_dec_range(const char *p,short& start, short& end) {
> >    start = *(short *)p;
> >    end   = *(short *)(p + sizeof(short));
> >    return p + 2*sizeof(short);
> > }
> >
> >
> > //
> > // Some C++ template wizardry to make the single encode/decode
> function possible.
> > //
> > enum SERIAL_TYPE {
> >    ESTIMATE,
> >    ENCODE,
> >    DECODE
> > };
> >
> > template <enum SERIAL_TYPE s> struct serial_type;
> >
> > template<> struct serial_type<ESTIMATE> { typedef size_t type; };
> > template<> struct serial_type<ENCODE>   { typedef char * type; };
> > template<> struct serial_type<DECODE>   { typedef const char *type; };
> >
> > //
> > // This macro is the key, it connects the external non-member function to
> the correct member function.
> > //
> > #define DEFINE_STRUCT_ENC_DEC(s) \
> > inline size_t      enc_dec(size_t p, s &o) { return o.enc_dec<ESTIMATE>(p); }
> \
> > inline char *      enc_dec(char *p , s &o)  { return o.enc_dec<ENCODE>(p); }
> \
> > inline const char *enc_dec(const char *p,s &o)  { return
> > o.enc_dec<DECODE>(p); }
> >
> > //
> > // Our example structure
> > //
> > struct astruct {
> >    int a;
> >    set<int> b;
> >    int lba;
> >    short start,end;
> >
> >    //
> >    // <<<<< You need to provide this function just one.
> >    //
> >    template<enum SERIAL_TYPE s> typename serial_type<s>::type
> enc_dec(typename serial_type<s>::type p) {
> >       p = ::enc_dec(p,a);
> >       p = ::enc_dec(p,b);
> >       p = ::enc_dec_lba(p,lba);
> >       p = ::enc_dec_range(p,start,end);
> >       return p;
> >    }
> > };
> >
> > //
> > // This macro connects the global enc_dec to the member function.
> > // One of these per struct declaration //
> > DEFINE_STRUCT_ENC_DEC(astruct);
> >
> >
> > //
> > // Here's a simple test program. The real encode/decode framework
> > needs to be connected to bufferlist using the pseudo-code // that I
> documented in my previous email.
> > //
> >
> > int main(int argc,char **argv) {
> >
> >    astruct a;
> >    a.a = 10;
> >    a.b.insert(2);
> >    a.b.insert(3);
> >    a.lba = 12;
> >
> >    size_t s = a.enc_dec<ESTIMATE>(size_t(0));
> >    cout << "Estimated size is " << s << "\n";
> >
> >    char buffer[100];
> >
> >    char *end = a.enc_dec<ENCODE>(buffer);
> >
> >    cout << "Actual storage was " << end-buffer << "\n";
> >
> >    astruct b;
> >
> >    (void) b.enc_dec<DECODE>(buffer); // decode it
> >
> >    cout << "A.a = " << b.a << "\n";
> >    for (auto e : b.b) {
> >       cout << " " << e;
> >    }
> >
> >    cout << "\n";
> >
> >    cout << "a.lba = " << b.lba << "\n";
> >
> >    return 0;
> > }
> >
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> >> -----Original Message-----
> >> From: Mark Nelson [mailto:mnelson@redhat.com]
> >> Sent: Tuesday, July 12, 2016 8:13 PM
> >> To: Sage Weil <sweil@redhat.com>; Allen Samuels
> >> <Allen.Samuels@sandisk.com>
> >> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> >> Subject: Re: bluestore onode diet and encoding overhead
> >>
> >>
> >>
> >> On 07/12/2016 08:50 PM, Sage Weil wrote:
> >>> On Tue, 12 Jul 2016, Allen Samuels wrote:
> >>>> Good analysis.
> >>>>
> >>>> My original comments about putting the oNode on a diet included the
> >>>> idea of a "custom" encode/decode path for certain high-usage cases.
> >>>> At the time, Sage resisted going down that path hoping that a more
> >>>> optimized generic case would get the job done. Your analysis shows
> >>>> that while we've achieved significant space reduction this has come
> >>>> at the expense of CPU time -- which dominates small object
> >>>> performance (I suspect that eventually we'd discover that the
> >>>> variable length decode path would be responsible for a substantial
> >>>> read performance degradation also -- which may or may not be part of
> >>>> the read performance drop-off that you're seeing). This isn't a
> surprising
> >> result, though it is unfortunate.
> >>>>
> >>>> I believe we need to revisit the idea of custom encode/decode paths
> >>>> for high-usage cases, only now the gains need to be focused on CPU
> >>>> utilization as well as space efficiency.
> >>>
> >>> I still think we can get most or all of the way there in a generic way
> >>> by revising the way that we interact with bufferlist for encode and
> decode.
> >>> We haven't actually tried to optimize this yet, and the current code
> >>> is pretty horribly inefficient (asserts all over the place, and many
> >>> layers of pointer indirection to do a simple append).  I think we need
> >>> to do two
> >>> things:
> >>>
> >>> 1) decode path: optimize the iterator class so that it has a const
> >>> char *current and const char *current_end that point into the current
> >>> buffer::ptr.  This way any decode will have a single pointer
> >>> add+comparison to ensure there is enough data to copy before falling
> >>> add+into
> >>> the slow path (partial buffer, move to next buffer, etc.).
> >>>
> >>
> >> I don't have a good sense yet for how much this is hurting us in the read
> >> path.  We screwed something up in the last couple of weeks and small
> reads
> >> are quite slow.
> >>
> >>> 2) Having that comparison is still not ideal, but we shoudl consider
> >>> ways to get around that too.  For example, if we know that we are
> >>> going to decode N M-byte things, we could do an iterator 'reserve' or
> >>> 'check' that ensures we have a valid pointer for that much and then
> >>> proceed without checks.  The interface here would be tricky, though,
> >>> since in the slow case we'll span buffers and need to magically fall
> >>> back to a different decode path (hard to maintain) or do a temporary
> >>> copy (probably faster but we need to ensure the iterator owns it and
> >>> frees is later).  I'd say this is step 2 and optional; step 1 will have the most
> >> benefit.
> >>>
> >>> 3) encode path: currently all encode methods take a bufferlist& and
> >>> the bufferlist itself as an append buffer.  I think this is flawed and
> >>> limiting.  Instead, we should make a new class called
> >>> buffer::list::appender (or similar) and templatize the encode methods
> >>> so they can take a safe_appender (which does bounds checking) or an
> >>> unsafe_appender (which does not).  For the latter, the user takes
> >>> responsibility for making sure there is enough space by doing a
> >>> reserve() type call which returns an unsafe_appender, and it's their
> >>> job to make sure they don't shove too much data into it.  That should
> >>> make the encode path a memcpy + ptr increment (for savvy/optimized
> >> callers).
> >>
> >> Seems reasonable and similar in performance to what Piotr and I were
> >> discussing this morning.  As a very simple test I was thinking of doing a
> quick
> >> size computation and then passing that in to increase the append_buffer
> size
> >> when the bufferlist is created in Bluestore::_txc_write_nodes.  His idea
> went
> >> a bit farther to break the encapsulation, compute the fully encoded
> >> message, and dump it directly into a buffer of a computed size without
> the
> >> extra assert checks or bounds checking.  Obviously his idea would be
> faster
> >> but more work.
> >>
> >> It sounds like your solution would be similar but a bit more formalized.
> >>
> >>>
> >>> I suggest we use bluestore as a test case to make the interfaces work
> >>> and be fast.  If we succeed we can take advantage of it across the
> >>> reset of the code base as well.
> >>
> >> Do we have other places in the code with similar byte append behavior?
> >> That's what's really killing us I think, especially with how small the new
> >> append_buffer is when you run out of space when appending bytes.
> >>
> >>>
> >>> That's my thinking, at least.  I haven't had time to prototype it out
> >>> yet, but I think our goal should be to make the encode/decode paths
> >>> capable of being a memcpy + ptr addition in the fast path, and let
> >>> that guide the interface...
> >>>
> >>> sage
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> >> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >>>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-14 11:15         ` Mark Nelson
  2016-07-14 14:10           ` Allen Samuels
  2016-07-14 14:14           ` Allen Samuels
@ 2016-07-14 16:20           ` Allen Samuels
  2016-07-14 16:31             ` Mark Nelson
  2 siblings, 1 reply; 39+ messages in thread
From: Allen Samuels @ 2016-07-14 16:20 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil; +Cc: ceph-devel

BTW, I see this stuff as gradually replacing the existing encode/decode infrastructure. It's pretty easy to have them side-by-side as well as have the new infrastructure be wire-compatible with the current stuff. That'll allow a slow conversion from the old-style to the new-style.  The only that's really different between the two (on the wire) is that I proposed the new stuff to have a length prefix so that the decoder knows how much data to "straighten" before launching the fast decode (this is the equivalent of the ESTIMATE phase during encode). For old-style stuff that doesn't have the prefix, you'll have to "straighten" the entire remainder of the buffer -- this may limit the rate of conversion (in that you can only afford to covert code to the new style when you know that the overhead
  of straightening is affordable -- probably because you know that there's not much data present OR you assume that you're in a temporary transient environment.


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Thursday, July 14, 2016 4:16 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sweil@redhat.com>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: bluestore onode diet and encoding overhead
> 
> On 07/14/2016 12:52 AM, Allen Samuels wrote:
> > As promised, here's some code that hacks out a new encode/decode
> framework. That has the advantage of only having to list the fields of a struct
> once and is pretty much guaranteed to never overrun a buffer....
> >
> > Comments are requested :)
> 
> It compiles! :D
> 
> I looked over the code, but I want to look it over again after I've had my
> coffee since I'm still shaking the cobwebs out.  Would the idea here be that if
> you are doing varint encoding for example that you always allocate the
> buffer based on ESTIMATE (also taking into account the encoding overhead),
> but typically expect a much smaller encoding?
> 
> As it is, it's very clever.
> 
> Mark
> 
> >
> >
> > #include <iostream>
> > #include <fstream>
> > #include <set>
> > #include <string>
> > #include <string.h>
> >
> > /*******************************************************
> >
> >
> >    New fast encode/decode framework.
> >
> >    The entire framework is built around the idea that each object has three
> operations:
> >
> >      ESTIMATE  -- worst-case estimate of the amount of storage required for
> this object
> >      ENCODE    -- encode object into buffer of size ESTIMATE
> >      DECODE    -- encode object from buffer of size actual.
> >
> >    Each object has a single templated function that actually provides all three
> operations in a single set of code.
> >    But doing this, it's pretty much guaranteed that the ESTIMATE and the
> ENCODE code are in harmony (i.e. that the estimate is correct)
> >    it also saves a lot of typing/reading...
> >
> >    Generally, all three operations are provided on a single function name
> with the input and return parameters overloaded to distinguish them.
> >
> >    It's observed that for each of the three operations there is a single value
> which needs to be transmitted between each of the micro-encode/decode
> calls
> >    Yes, this is confusing, but let's look at a simple example
> >
> >     struct simple {
> >       int a;
> >       float b;
> >       string c;
> >       set<int> d;
> >     };
> >
> >     To encode this struct we generate a function that does the micro-
> encoding of each of the fields of the struct
> >     Here's an example of a function that does the ESTIMATE operation.
> >
> >     size_t simple::estimate() {
> >        return
> >           sizeof(a) +
> >           sizeof(b) +
> >           c.size() +
> >           d.size() * sizeof(int);
> >     }
> >
> >     We're going to re-write it as:
> >
> >     size_t simple::estimate(size_t p) {
> >        p = estimate(p,a);
> >        p = estimate(p,b);
> >        p = estimate(p,c);
> >        p = estimate(p,d);
> >        return p;
> >     }
> >
> >     assuming that the sorta function:
> >
> >     template<typename t> size_t estimate(size_t p,t& o) { return p +
> sizeof(o); }
> >     template<typename t> size_t estimate(size_t p,set<t>& o) { return
> > p + o.size() * sizeof(t); }
> >
> >
> >     similarly, the encode operation is represented as:
> >
> >     char * simple::encode(char *p) {
> >        p = encode(p,a);
> >        p = encode(p,b);
> >        p = encode(p,c);
> >        p = encode(p,d);
> >        return p;
> >     }
> >
> >     similarly, the decode operation is represented as:
> >
> >     const char * simple::decode(const char *p) {
> >        p = decode(p,a);
> >        p = decode(p,b);
> >        p = decode(p,c);
> >        p = decode(p,d);
> >        return p;
> >     }
> >
> >
> > You can now see that it's possible to create a single function that
> > does all three operations in a single block of code, provided that you can
> fiddle the input/output parameter types appropriately.
> >
> > In essence the pattern is
> >
> >     p = enc_dec(p,struct_field_1);
> >     p = enc_dec(p,struct_field_2);
> >     p = enc_dec(p,struct_field_3);
> >
> > With the type of p being set differently for each operation, i.e.,
> >     for ESTIMATE, p = size_t
> >     for ENCODE,   p = char *
> >     for DECODE,   p = const char *
> >
> > This is the essence of how the encode/decode framework operates.
> Though there is some more sophistication...
> >
> > ----------------------
> >
> > We also want to allow the encode/decode machinery to be per-type and
> > to operate
> >
> >
> **********************************************************
> ************
> > *******/
> >
> > using namespace std;
> >
> > //
> > // Just like the existing encode/decode machinery. The environment
> > provides a rich set of // pre-defined encodes for primitive types and
> > containers //
> >
> > #define DEFINE_ENC_DEC_RAW(type) \
> > inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
> > inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p +
> sizeof(type); } \
> > inline const char *enc_dec(const char *p,type &o) { o = *(const type
> > *)p; return p + sizeof(type); }
> >
> > DEFINE_ENC_DEC_RAW(int);
> > DEFINE_ENC_DEC_RAW(size_t);
> >
> > //
> > // String encode/decode (Yea, I know size_t isn't portable -- this is
> > an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s) {
> > return p + sizeof(size_t) + s.size(); } inline char * enc_dec(char *
> > p,string& s) { *(size_t *)p = s.size();
> > memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p + sizeof(size_t)
> > + s.size(); } inline const char *enc_dec(const char *p,string& s) { s
> > = string(p + sizeof(size_t),*(size_t *)p); return p + sizeof(size_t) +
> > s.size(); }
> >
> > //
> > // Let's do a container.
> > //
> > // One of the problems with a container is that making an accurate
> > estimate of the size // would theoretically require that you walk the entire
> container and add up the sizes of each element.
> > // We probably don't want to do that. So here, I do a hack that just
> > assumes that I can fake up a individual element // and multiple that
> > by the number of elements in a container. This hack works anytime that
> > the estimate function // for the contained type has a fixed maximum size.
> BTW, this is safe, if the contained type has a variable size //  (like set<string>)
> then it will fault out the first time you run it.
> > //
> > // Naturally, something like set<string> or map<string,string> is a
> > highly desirable thing to be able to encode/decode // there's no reason
> that you can't create a enc_dec_slow function that properly computes the
> maximum size by walking the container.
> > //
> > template<typename t>
> > inline size_t enc_dec(size_t p,set<t>& s) { return p + sizeof(size_t)
> > + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }
> >
> > template<typename t>
> > inline char *enc_dec(char *p,set<t>& s) {
> >    size_t sz = s.size();
> >    p = enc_dec(p,sz);
> >    for (const t& e : s) {
> >       p = enc_dec(p,const_cast<t&>(e));
> >    }
> >    return p;
> > }
> >
> > template<typename t>
> > inline const char *enc_dec(const char *p,set<t>&s) {
> >    size_t sz;
> >    p = enc_dec(p,sz);
> >    while (sz--) {
> >       t temp;
> >       p = enc_dec(p,temp);
> >       s.insert(temp);
> >    }
> >    return p;
> > }
> >
> > //
> > // Specialized encode/decode for a single data type. These are invoked
> explicitly...
> > //
> > inline size_t enc_dec_lba(size_t p,int& lba) {
> >    return p + sizeof(lba); // Max....
> > }
> >
> > inline char * enc_dec_lba(char *p,int& lba) {
> >    *p = 15;
> >    return p + 1; // blah blah
> > }
> >
> > inline const char *enc_dec_lba(const char *p,int& lba) {
> >    lba = *p;
> >    return p+1;
> > }
> >
> > //
> > // Specialized encode/decode for more sophisticated things primitives.
> > //
> > // Here's an example of a encode/decoder for a pair of fields //
> > inline size_t enc_dec_range(size_t p,short& start,short& end) {
> >    return p + 2 * sizeof(short);
> > }
> >
> > inline char *enc_dec_range(char *p, short& start, short& end) {
> >    short *s = (short *) p;
> >    s[0] = start;
> >    s[1] = end;
> >    return p + sizeof(short) * 2;
> > }
> >
> > inline const char *enc_dec_range(const char *p,short& start, short& end) {
> >    start = *(short *)p;
> >    end   = *(short *)(p + sizeof(short));
> >    return p + 2*sizeof(short);
> > }
> >
> >
> > //
> > // Some C++ template wizardry to make the single encode/decode
> function possible.
> > //
> > enum SERIAL_TYPE {
> >    ESTIMATE,
> >    ENCODE,
> >    DECODE
> > };
> >
> > template <enum SERIAL_TYPE s> struct serial_type;
> >
> > template<> struct serial_type<ESTIMATE> { typedef size_t type; };
> > template<> struct serial_type<ENCODE>   { typedef char * type; };
> > template<> struct serial_type<DECODE>   { typedef const char *type; };
> >
> > //
> > // This macro is the key, it connects the external non-member function to
> the correct member function.
> > //
> > #define DEFINE_STRUCT_ENC_DEC(s) \
> > inline size_t      enc_dec(size_t p, s &o) { return o.enc_dec<ESTIMATE>(p); }
> \
> > inline char *      enc_dec(char *p , s &o)  { return o.enc_dec<ENCODE>(p); }
> \
> > inline const char *enc_dec(const char *p,s &o)  { return
> > o.enc_dec<DECODE>(p); }
> >
> > //
> > // Our example structure
> > //
> > struct astruct {
> >    int a;
> >    set<int> b;
> >    int lba;
> >    short start,end;
> >
> >    //
> >    // <<<<< You need to provide this function just one.
> >    //
> >    template<enum SERIAL_TYPE s> typename serial_type<s>::type
> enc_dec(typename serial_type<s>::type p) {
> >       p = ::enc_dec(p,a);
> >       p = ::enc_dec(p,b);
> >       p = ::enc_dec_lba(p,lba);
> >       p = ::enc_dec_range(p,start,end);
> >       return p;
> >    }
> > };
> >
> > //
> > // This macro connects the global enc_dec to the member function.
> > // One of these per struct declaration //
> > DEFINE_STRUCT_ENC_DEC(astruct);
> >
> >
> > //
> > // Here's a simple test program. The real encode/decode framework
> > needs to be connected to bufferlist using the pseudo-code // that I
> documented in my previous email.
> > //
> >
> > int main(int argc,char **argv) {
> >
> >    astruct a;
> >    a.a = 10;
> >    a.b.insert(2);
> >    a.b.insert(3);
> >    a.lba = 12;
> >
> >    size_t s = a.enc_dec<ESTIMATE>(size_t(0));
> >    cout << "Estimated size is " << s << "\n";
> >
> >    char buffer[100];
> >
> >    char *end = a.enc_dec<ENCODE>(buffer);
> >
> >    cout << "Actual storage was " << end-buffer << "\n";
> >
> >    astruct b;
> >
> >    (void) b.enc_dec<DECODE>(buffer); // decode it
> >
> >    cout << "A.a = " << b.a << "\n";
> >    for (auto e : b.b) {
> >       cout << " " << e;
> >    }
> >
> >    cout << "\n";
> >
> >    cout << "a.lba = " << b.lba << "\n";
> >
> >    return 0;
> > }
> >
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> >> -----Original Message-----
> >> From: Mark Nelson [mailto:mnelson@redhat.com]
> >> Sent: Tuesday, July 12, 2016 8:13 PM
> >> To: Sage Weil <sweil@redhat.com>; Allen Samuels
> >> <Allen.Samuels@sandisk.com>
> >> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> >> Subject: Re: bluestore onode diet and encoding overhead
> >>
> >>
> >>
> >> On 07/12/2016 08:50 PM, Sage Weil wrote:
> >>> On Tue, 12 Jul 2016, Allen Samuels wrote:
> >>>> Good analysis.
> >>>>
> >>>> My original comments about putting the oNode on a diet included the
> >>>> idea of a "custom" encode/decode path for certain high-usage cases.
> >>>> At the time, Sage resisted going down that path hoping that a more
> >>>> optimized generic case would get the job done. Your analysis shows
> >>>> that while we've achieved significant space reduction this has come
> >>>> at the expense of CPU time -- which dominates small object
> >>>> performance (I suspect that eventually we'd discover that the
> >>>> variable length decode path would be responsible for a substantial
> >>>> read performance degradation also -- which may or may not be part of
> >>>> the read performance drop-off that you're seeing). This isn't a
> surprising
> >> result, though it is unfortunate.
> >>>>
> >>>> I believe we need to revisit the idea of custom encode/decode paths
> >>>> for high-usage cases, only now the gains need to be focused on CPU
> >>>> utilization as well as space efficiency.
> >>>
> >>> I still think we can get most or all of the way there in a generic way
> >>> by revising the way that we interact with bufferlist for encode and
> decode.
> >>> We haven't actually tried to optimize this yet, and the current code
> >>> is pretty horribly inefficient (asserts all over the place, and many
> >>> layers of pointer indirection to do a simple append).  I think we need
> >>> to do two
> >>> things:
> >>>
> >>> 1) decode path: optimize the iterator class so that it has a const
> >>> char *current and const char *current_end that point into the current
> >>> buffer::ptr.  This way any decode will have a single pointer
> >>> add+comparison to ensure there is enough data to copy before falling
> >>> add+into
> >>> the slow path (partial buffer, move to next buffer, etc.).
> >>>
> >>
> >> I don't have a good sense yet for how much this is hurting us in the read
> >> path.  We screwed something up in the last couple of weeks and small
> reads
> >> are quite slow.
> >>
> >>> 2) Having that comparison is still not ideal, but we shoudl consider
> >>> ways to get around that too.  For example, if we know that we are
> >>> going to decode N M-byte things, we could do an iterator 'reserve' or
> >>> 'check' that ensures we have a valid pointer for that much and then
> >>> proceed without checks.  The interface here would be tricky, though,
> >>> since in the slow case we'll span buffers and need to magically fall
> >>> back to a different decode path (hard to maintain) or do a temporary
> >>> copy (probably faster but we need to ensure the iterator owns it and
> >>> frees is later).  I'd say this is step 2 and optional; step 1 will have the most
> >> benefit.
> >>>
> >>> 3) encode path: currently all encode methods take a bufferlist& and
> >>> the bufferlist itself as an append buffer.  I think this is flawed and
> >>> limiting.  Instead, we should make a new class called
> >>> buffer::list::appender (or similar) and templatize the encode methods
> >>> so they can take a safe_appender (which does bounds checking) or an
> >>> unsafe_appender (which does not).  For the latter, the user takes
> >>> responsibility for making sure there is enough space by doing a
> >>> reserve() type call which returns an unsafe_appender, and it's their
> >>> job to make sure they don't shove too much data into it.  That should
> >>> make the encode path a memcpy + ptr increment (for savvy/optimized
> >> callers).
> >>
> >> Seems reasonable and similar in performance to what Piotr and I were
> >> discussing this morning.  As a very simple test I was thinking of doing a
> quick
> >> size computation and then passing that in to increase the append_buffer
> size
> >> when the bufferlist is created in Bluestore::_txc_write_nodes.  His idea
> went
> >> a bit farther to break the encapsulation, compute the fully encoded
> >> message, and dump it directly into a buffer of a computed size without
> the
> >> extra assert checks or bounds checking.  Obviously his idea would be
> faster
> >> but more work.
> >>
> >> It sounds like your solution would be similar but a bit more formalized.
> >>
> >>>
> >>> I suggest we use bluestore as a test case to make the interfaces work
> >>> and be fast.  If we succeed we can take advantage of it across the
> >>> reset of the code base as well.
> >>
> >> Do we have other places in the code with similar byte append behavior?
> >> That's what's really killing us I think, especially with how small the new
> >> append_buffer is when you run out of space when appending bytes.
> >>
> >>>
> >>> That's my thinking, at least.  I haven't had time to prototype it out
> >>> yet, but I think our goal should be to make the encode/decode paths
> >>> capable of being a memcpy + ptr addition in the fast path, and let
> >>> that guide the interface...
> >>>
> >>> sage
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> >> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >>>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: bluestore onode diet and encoding overhead
  2016-07-14 16:20           ` Allen Samuels
@ 2016-07-14 16:31             ` Mark Nelson
  2016-07-14 16:34               ` Allen Samuels
  0 siblings, 1 reply; 39+ messages in thread
From: Mark Nelson @ 2016-07-14 16:31 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel

So right now I'm knee deep in bisecting bluestore to track down our read 
regression, but if you want to throw together a PR that uses this for 
encode in bluestore I'd be certainly happy to give it a whirl on our 
test cluster.

Mark

On 07/14/2016 11:20 AM, Allen Samuels wrote:
> BTW, I see this stuff as gradually replacing the existing encode/decode infrastructure. It's pretty easy to have them side-by-side as well as have the new infrastructure be wire-compatible with the current stuff. That'll allow a slow conversion from the old-style to the new-style.  The only that's really different between the two (on the wire) is that I proposed the new stuff to have a length prefix so that the decoder knows how much data to "straighten" before launching the fast decode (this is the equivalent of the ESTIMATE phase during encode). For old-style stuff that doesn't have the prefix, you'll have to "straighten" the entire remainder of the buffer -- this may limit the rate of conversion (in that you can only afford to covert code to the new style when you know that the overhe
 ad of straightening is affordable -- probably because you know that there's not much data present OR you assume that you're in a temporary transient environment.
>
>
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mnelson@redhat.com]
>> Sent: Thursday, July 14, 2016 4:16 AM
>> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
>> <sweil@redhat.com>
>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: Re: bluestore onode diet and encoding overhead
>>
>> On 07/14/2016 12:52 AM, Allen Samuels wrote:
>>> As promised, here's some code that hacks out a new encode/decode
>> framework. That has the advantage of only having to list the fields of a struct
>> once and is pretty much guaranteed to never overrun a buffer....
>>>
>>> Comments are requested :)
>>
>> It compiles! :D
>>
>> I looked over the code, but I want to look it over again after I've had my
>> coffee since I'm still shaking the cobwebs out.  Would the idea here be that if
>> you are doing varint encoding for example that you always allocate the
>> buffer based on ESTIMATE (also taking into account the encoding overhead),
>> but typically expect a much smaller encoding?
>>
>> As it is, it's very clever.
>>
>> Mark
>>
>>>
>>>
>>> #include <iostream>
>>> #include <fstream>
>>> #include <set>
>>> #include <string>
>>> #include <string.h>
>>>
>>> /*******************************************************
>>>
>>>
>>>    New fast encode/decode framework.
>>>
>>>    The entire framework is built around the idea that each object has three
>> operations:
>>>
>>>      ESTIMATE  -- worst-case estimate of the amount of storage required for
>> this object
>>>      ENCODE    -- encode object into buffer of size ESTIMATE
>>>      DECODE    -- encode object from buffer of size actual.
>>>
>>>    Each object has a single templated function that actually provides all three
>> operations in a single set of code.
>>>    But doing this, it's pretty much guaranteed that the ESTIMATE and the
>> ENCODE code are in harmony (i.e. that the estimate is correct)
>>>    it also saves a lot of typing/reading...
>>>
>>>    Generally, all three operations are provided on a single function name
>> with the input and return parameters overloaded to distinguish them.
>>>
>>>    It's observed that for each of the three operations there is a single value
>> which needs to be transmitted between each of the micro-encode/decode
>> calls
>>>    Yes, this is confusing, but let's look at a simple example
>>>
>>>     struct simple {
>>>       int a;
>>>       float b;
>>>       string c;
>>>       set<int> d;
>>>     };
>>>
>>>     To encode this struct we generate a function that does the micro-
>> encoding of each of the fields of the struct
>>>     Here's an example of a function that does the ESTIMATE operation.
>>>
>>>     size_t simple::estimate() {
>>>        return
>>>           sizeof(a) +
>>>           sizeof(b) +
>>>           c.size() +
>>>           d.size() * sizeof(int);
>>>     }
>>>
>>>     We're going to re-write it as:
>>>
>>>     size_t simple::estimate(size_t p) {
>>>        p = estimate(p,a);
>>>        p = estimate(p,b);
>>>        p = estimate(p,c);
>>>        p = estimate(p,d);
>>>        return p;
>>>     }
>>>
>>>     assuming that the sorta function:
>>>
>>>     template<typename t> size_t estimate(size_t p,t& o) { return p +
>> sizeof(o); }
>>>     template<typename t> size_t estimate(size_t p,set<t>& o) { return
>>> p + o.size() * sizeof(t); }
>>>
>>>
>>>     similarly, the encode operation is represented as:
>>>
>>>     char * simple::encode(char *p) {
>>>        p = encode(p,a);
>>>        p = encode(p,b);
>>>        p = encode(p,c);
>>>        p = encode(p,d);
>>>        return p;
>>>     }
>>>
>>>     similarly, the decode operation is represented as:
>>>
>>>     const char * simple::decode(const char *p) {
>>>        p = decode(p,a);
>>>        p = decode(p,b);
>>>        p = decode(p,c);
>>>        p = decode(p,d);
>>>        return p;
>>>     }
>>>
>>>
>>> You can now see that it's possible to create a single function that
>>> does all three operations in a single block of code, provided that you can
>> fiddle the input/output parameter types appropriately.
>>>
>>> In essence the pattern is
>>>
>>>     p = enc_dec(p,struct_field_1);
>>>     p = enc_dec(p,struct_field_2);
>>>     p = enc_dec(p,struct_field_3);
>>>
>>> With the type of p being set differently for each operation, i.e.,
>>>     for ESTIMATE, p = size_t
>>>     for ENCODE,   p = char *
>>>     for DECODE,   p = const char *
>>>
>>> This is the essence of how the encode/decode framework operates.
>> Though there is some more sophistication...
>>>
>>> ----------------------
>>>
>>> We also want to allow the encode/decode machinery to be per-type and
>>> to operate
>>>
>>>
>> **********************************************************
>> ************
>>> *******/
>>>
>>> using namespace std;
>>>
>>> //
>>> // Just like the existing encode/decode machinery. The environment
>>> provides a rich set of // pre-defined encodes for primitive types and
>>> containers //
>>>
>>> #define DEFINE_ENC_DEC_RAW(type) \
>>> inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
>>> inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p +
>> sizeof(type); } \
>>> inline const char *enc_dec(const char *p,type &o) { o = *(const type
>>> *)p; return p + sizeof(type); }
>>>
>>> DEFINE_ENC_DEC_RAW(int);
>>> DEFINE_ENC_DEC_RAW(size_t);
>>>
>>> //
>>> // String encode/decode (Yea, I know size_t isn't portable -- this is
>>> an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s) {
>>> return p + sizeof(size_t) + s.size(); } inline char * enc_dec(char *
>>> p,string& s) { *(size_t *)p = s.size();
>>> memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p + sizeof(size_t)
>>> + s.size(); } inline const char *enc_dec(const char *p,string& s) { s
>>> = string(p + sizeof(size_t),*(size_t *)p); return p + sizeof(size_t) +
>>> s.size(); }
>>>
>>> //
>>> // Let's do a container.
>>> //
>>> // One of the problems with a container is that making an accurate
>>> estimate of the size // would theoretically require that you walk the entire
>> container and add up the sizes of each element.
>>> // We probably don't want to do that. So here, I do a hack that just
>>> assumes that I can fake up a individual element // and multiple that
>>> by the number of elements in a container. This hack works anytime that
>>> the estimate function // for the contained type has a fixed maximum size.
>> BTW, this is safe, if the contained type has a variable size //  (like set<string>)
>> then it will fault out the first time you run it.
>>> //
>>> // Naturally, something like set<string> or map<string,string> is a
>>> highly desirable thing to be able to encode/decode // there's no reason
>> that you can't create a enc_dec_slow function that properly computes the
>> maximum size by walking the container.
>>> //
>>> template<typename t>
>>> inline size_t enc_dec(size_t p,set<t>& s) { return p + sizeof(size_t)
>>> + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }
>>>
>>> template<typename t>
>>> inline char *enc_dec(char *p,set<t>& s) {
>>>    size_t sz = s.size();
>>>    p = enc_dec(p,sz);
>>>    for (const t& e : s) {
>>>       p = enc_dec(p,const_cast<t&>(e));
>>>    }
>>>    return p;
>>> }
>>>
>>> template<typename t>
>>> inline const char *enc_dec(const char *p,set<t>&s) {
>>>    size_t sz;
>>>    p = enc_dec(p,sz);
>>>    while (sz--) {
>>>       t temp;
>>>       p = enc_dec(p,temp);
>>>       s.insert(temp);
>>>    }
>>>    return p;
>>> }
>>>
>>> //
>>> // Specialized encode/decode for a single data type. These are invoked
>> explicitly...
>>> //
>>> inline size_t enc_dec_lba(size_t p,int& lba) {
>>>    return p + sizeof(lba); // Max....
>>> }
>>>
>>> inline char * enc_dec_lba(char *p,int& lba) {
>>>    *p = 15;
>>>    return p + 1; // blah blah
>>> }
>>>
>>> inline const char *enc_dec_lba(const char *p,int& lba) {
>>>    lba = *p;
>>>    return p+1;
>>> }
>>>
>>> //
>>> // Specialized encode/decode for more sophisticated things primitives.
>>> //
>>> // Here's an example of a encode/decoder for a pair of fields //
>>> inline size_t enc_dec_range(size_t p,short& start,short& end) {
>>>    return p + 2 * sizeof(short);
>>> }
>>>
>>> inline char *enc_dec_range(char *p, short& start, short& end) {
>>>    short *s = (short *) p;
>>>    s[0] = start;
>>>    s[1] = end;
>>>    return p + sizeof(short) * 2;
>>> }
>>>
>>> inline const char *enc_dec_range(const char *p,short& start, short& end) {
>>>    start = *(short *)p;
>>>    end   = *(short *)(p + sizeof(short));
>>>    return p + 2*sizeof(short);
>>> }
>>>
>>>
>>> //
>>> // Some C++ template wizardry to make the single encode/decode
>> function possible.
>>> //
>>> enum SERIAL_TYPE {
>>>    ESTIMATE,
>>>    ENCODE,
>>>    DECODE
>>> };
>>>
>>> template <enum SERIAL_TYPE s> struct serial_type;
>>>
>>> template<> struct serial_type<ESTIMATE> { typedef size_t type; };
>>> template<> struct serial_type<ENCODE>   { typedef char * type; };
>>> template<> struct serial_type<DECODE>   { typedef const char *type; };
>>>
>>> //
>>> // This macro is the key, it connects the external non-member function to
>> the correct member function.
>>> //
>>> #define DEFINE_STRUCT_ENC_DEC(s) \
>>> inline size_t      enc_dec(size_t p, s &o) { return o.enc_dec<ESTIMATE>(p); }
>> \
>>> inline char *      enc_dec(char *p , s &o)  { return o.enc_dec<ENCODE>(p); }
>> \
>>> inline const char *enc_dec(const char *p,s &o)  { return
>>> o.enc_dec<DECODE>(p); }
>>>
>>> //
>>> // Our example structure
>>> //
>>> struct astruct {
>>>    int a;
>>>    set<int> b;
>>>    int lba;
>>>    short start,end;
>>>
>>>    //
>>>    // <<<<< You need to provide this function just one.
>>>    //
>>>    template<enum SERIAL_TYPE s> typename serial_type<s>::type
>> enc_dec(typename serial_type<s>::type p) {
>>>       p = ::enc_dec(p,a);
>>>       p = ::enc_dec(p,b);
>>>       p = ::enc_dec_lba(p,lba);
>>>       p = ::enc_dec_range(p,start,end);
>>>       return p;
>>>    }
>>> };
>>>
>>> //
>>> // This macro connects the global enc_dec to the member function.
>>> // One of these per struct declaration //
>>> DEFINE_STRUCT_ENC_DEC(astruct);
>>>
>>>
>>> //
>>> // Here's a simple test program. The real encode/decode framework
>>> needs to be connected to bufferlist using the pseudo-code // that I
>> documented in my previous email.
>>> //
>>>
>>> int main(int argc,char **argv) {
>>>
>>>    astruct a;
>>>    a.a = 10;
>>>    a.b.insert(2);
>>>    a.b.insert(3);
>>>    a.lba = 12;
>>>
>>>    size_t s = a.enc_dec<ESTIMATE>(size_t(0));
>>>    cout << "Estimated size is " << s << "\n";
>>>
>>>    char buffer[100];
>>>
>>>    char *end = a.enc_dec<ENCODE>(buffer);
>>>
>>>    cout << "Actual storage was " << end-buffer << "\n";
>>>
>>>    astruct b;
>>>
>>>    (void) b.enc_dec<DECODE>(buffer); // decode it
>>>
>>>    cout << "A.a = " << b.a << "\n";
>>>    for (auto e : b.b) {
>>>       cout << " " << e;
>>>    }
>>>
>>>    cout << "\n";
>>>
>>>    cout << "a.lba = " << b.lba << "\n";
>>>
>>>    return 0;
>>> }
>>>
>>>
>>> Allen Samuels
>>> SanDisk |a Western Digital brand
>>> 2880 Junction Avenue, San Jose, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>
>>>
>>>> -----Original Message-----
>>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>>> Sent: Tuesday, July 12, 2016 8:13 PM
>>>> To: Sage Weil <sweil@redhat.com>; Allen Samuels
>>>> <Allen.Samuels@sandisk.com>
>>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
>>>> Subject: Re: bluestore onode diet and encoding overhead
>>>>
>>>>
>>>>
>>>> On 07/12/2016 08:50 PM, Sage Weil wrote:
>>>>> On Tue, 12 Jul 2016, Allen Samuels wrote:
>>>>>> Good analysis.
>>>>>>
>>>>>> My original comments about putting the oNode on a diet included the
>>>>>> idea of a "custom" encode/decode path for certain high-usage cases.
>>>>>> At the time, Sage resisted going down that path hoping that a more
>>>>>> optimized generic case would get the job done. Your analysis shows
>>>>>> that while we've achieved significant space reduction this has come
>>>>>> at the expense of CPU time -- which dominates small object
>>>>>> performance (I suspect that eventually we'd discover that the
>>>>>> variable length decode path would be responsible for a substantial
>>>>>> read performance degradation also -- which may or may not be part of
>>>>>> the read performance drop-off that you're seeing). This isn't a
>> surprising
>>>> result, though it is unfortunate.
>>>>>>
>>>>>> I believe we need to revisit the idea of custom encode/decode paths
>>>>>> for high-usage cases, only now the gains need to be focused on CPU
>>>>>> utilization as well as space efficiency.
>>>>>
>>>>> I still think we can get most or all of the way there in a generic way
>>>>> by revising the way that we interact with bufferlist for encode and
>> decode.
>>>>> We haven't actually tried to optimize this yet, and the current code
>>>>> is pretty horribly inefficient (asserts all over the place, and many
>>>>> layers of pointer indirection to do a simple append).  I think we need
>>>>> to do two
>>>>> things:
>>>>>
>>>>> 1) decode path: optimize the iterator class so that it has a const
>>>>> char *current and const char *current_end that point into the current
>>>>> buffer::ptr.  This way any decode will have a single pointer
>>>>> add+comparison to ensure there is enough data to copy before falling
>>>>> add+into
>>>>> the slow path (partial buffer, move to next buffer, etc.).
>>>>>
>>>>
>>>> I don't have a good sense yet for how much this is hurting us in the read
>>>> path.  We screwed something up in the last couple of weeks and small
>> reads
>>>> are quite slow.
>>>>
>>>>> 2) Having that comparison is still not ideal, but we shoudl consider
>>>>> ways to get around that too.  For example, if we know that we are
>>>>> going to decode N M-byte things, we could do an iterator 'reserve' or
>>>>> 'check' that ensures we have a valid pointer for that much and then
>>>>> proceed without checks.  The interface here would be tricky, though,
>>>>> since in the slow case we'll span buffers and need to magically fall
>>>>> back to a different decode path (hard to maintain) or do a temporary
>>>>> copy (probably faster but we need to ensure the iterator owns it and
>>>>> frees is later).  I'd say this is step 2 and optional; step 1 will have the most
>>>> benefit.
>>>>>
>>>>> 3) encode path: currently all encode methods take a bufferlist& and
>>>>> the bufferlist itself as an append buffer.  I think this is flawed and
>>>>> limiting.  Instead, we should make a new class called
>>>>> buffer::list::appender (or similar) and templatize the encode methods
>>>>> so they can take a safe_appender (which does bounds checking) or an
>>>>> unsafe_appender (which does not).  For the latter, the user takes
>>>>> responsibility for making sure there is enough space by doing a
>>>>> reserve() type call which returns an unsafe_appender, and it's their
>>>>> job to make sure they don't shove too much data into it.  That should
>>>>> make the encode path a memcpy + ptr increment (for savvy/optimized
>>>> callers).
>>>>
>>>> Seems reasonable and similar in performance to what Piotr and I were
>>>> discussing this morning.  As a very simple test I was thinking of doing a
>> quick
>>>> size computation and then passing that in to increase the append_buffer
>> size
>>>> when the bufferlist is created in Bluestore::_txc_write_nodes.  His idea
>> went
>>>> a bit farther to break the encapsulation, compute the fully encoded
>>>> message, and dump it directly into a buffer of a computed size without
>> the
>>>> extra assert checks or bounds checking.  Obviously his idea would be
>> faster
>>>> but more work.
>>>>
>>>> It sounds like your solution would be similar but a bit more formalized.
>>>>
>>>>>
>>>>> I suggest we use bluestore as a test case to make the interfaces work
>>>>> and be fast.  If we succeed we can take advantage of it across the
>>>>> reset of the code base as well.
>>>>
>>>> Do we have other places in the code with similar byte append behavior?
>>>> That's what's really killing us I think, especially with how small the new
>>>> append_buffer is when you run out of space when appending bytes.
>>>>
>>>>>
>>>>> That's my thinking, at least.  I haven't had time to prototype it out
>>>>> yet, but I think our goal should be to make the encode/decode paths
>>>>> capable of being a memcpy + ptr addition in the fast path, and let
>>>>> that guide the interface...
>>>>>
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-14 16:31             ` Mark Nelson
@ 2016-07-14 16:34               ` Allen Samuels
  0 siblings, 0 replies; 39+ messages in thread
From: Allen Samuels @ 2016-07-14 16:34 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil; +Cc: ceph-devel

Ideally, yes. Either it happens tomorrow or two weeks from now. 


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Thursday, July 14, 2016 9:32 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sweil@redhat.com>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: bluestore onode diet and encoding overhead
> 
> So right now I'm knee deep in bisecting bluestore to track down our read
> regression, but if you want to throw together a PR that uses this for encode
> in bluestore I'd be certainly happy to give it a whirl on our test cluster.
> 
> Mark
> 
> On 07/14/2016 11:20 AM, Allen Samuels wrote:
> > BTW, I see this stuff as gradually replacing the existing encode/decode
> infrastructure. It's pretty easy to have them side-by-side as well as have the
> new infrastructure be wire-compatible with the current stuff. That'll allow a
> slow conversion from the old-style to the new-style.  The only that's really
> different between the two (on the wire) is that I proposed the new stuff to
> have a length prefix so that the decoder knows how much data to
> "straighten" before launching the fast decode (this is the equivalent of the
> ESTIMATE phase during encode). For old-style stuff that doesn't have the
> prefix, you'll have to "straighten" the entire remainder of the buffer -- this
> may limit the rate of conversion (in that you can only afford to covert code to
> the new style when you know that the overhead of straightening is
> affordable -- probably because you know that there's not much data present
> OR you assume that you're in a temporary transient environment.
> >
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >> -----Original Message-----
> >> From: Mark Nelson [mailto:mnelson@redhat.com]
> >> Sent: Thursday, July 14, 2016 4:16 AM
> >> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> >> <sweil@redhat.com>
> >> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> >> Subject: Re: bluestore onode diet and encoding overhead
> >>
> >> On 07/14/2016 12:52 AM, Allen Samuels wrote:
> >>> As promised, here's some code that hacks out a new encode/decode
> >> framework. That has the advantage of only having to list the fields
> >> of a struct once and is pretty much guaranteed to never overrun a
> buffer....
> >>>
> >>> Comments are requested :)
> >>
> >> It compiles! :D
> >>
> >> I looked over the code, but I want to look it over again after I've
> >> had my coffee since I'm still shaking the cobwebs out.  Would the
> >> idea here be that if you are doing varint encoding for example that
> >> you always allocate the buffer based on ESTIMATE (also taking into
> >> account the encoding overhead), but typically expect a much smaller
> encoding?
> >>
> >> As it is, it's very clever.
> >>
> >> Mark
> >>
> >>>
> >>>
> >>> #include <iostream>
> >>> #include <fstream>
> >>> #include <set>
> >>> #include <string>
> >>> #include <string.h>
> >>>
> >>>
> /*******************************************************
> >>>
> >>>
> >>>    New fast encode/decode framework.
> >>>
> >>>    The entire framework is built around the idea that each object
> >>> has three
> >> operations:
> >>>
> >>>      ESTIMATE  -- worst-case estimate of the amount of storage
> >>> required for
> >> this object
> >>>      ENCODE    -- encode object into buffer of size ESTIMATE
> >>>      DECODE    -- encode object from buffer of size actual.
> >>>
> >>>    Each object has a single templated function that actually
> >>> provides all three
> >> operations in a single set of code.
> >>>    But doing this, it's pretty much guaranteed that the ESTIMATE and
> >>> the
> >> ENCODE code are in harmony (i.e. that the estimate is correct)
> >>>    it also saves a lot of typing/reading...
> >>>
> >>>    Generally, all three operations are provided on a single function
> >>> name
> >> with the input and return parameters overloaded to distinguish them.
> >>>
> >>>    It's observed that for each of the three operations there is a
> >>> single value
> >> which needs to be transmitted between each of the micro-
> encode/decode
> >> calls
> >>>    Yes, this is confusing, but let's look at a simple example
> >>>
> >>>     struct simple {
> >>>       int a;
> >>>       float b;
> >>>       string c;
> >>>       set<int> d;
> >>>     };
> >>>
> >>>     To encode this struct we generate a function that does the
> >>> micro-
> >> encoding of each of the fields of the struct
> >>>     Here's an example of a function that does the ESTIMATE operation.
> >>>
> >>>     size_t simple::estimate() {
> >>>        return
> >>>           sizeof(a) +
> >>>           sizeof(b) +
> >>>           c.size() +
> >>>           d.size() * sizeof(int);
> >>>     }
> >>>
> >>>     We're going to re-write it as:
> >>>
> >>>     size_t simple::estimate(size_t p) {
> >>>        p = estimate(p,a);
> >>>        p = estimate(p,b);
> >>>        p = estimate(p,c);
> >>>        p = estimate(p,d);
> >>>        return p;
> >>>     }
> >>>
> >>>     assuming that the sorta function:
> >>>
> >>>     template<typename t> size_t estimate(size_t p,t& o) { return p +
> >> sizeof(o); }
> >>>     template<typename t> size_t estimate(size_t p,set<t>& o) {
> >>> return p + o.size() * sizeof(t); }
> >>>
> >>>
> >>>     similarly, the encode operation is represented as:
> >>>
> >>>     char * simple::encode(char *p) {
> >>>        p = encode(p,a);
> >>>        p = encode(p,b);
> >>>        p = encode(p,c);
> >>>        p = encode(p,d);
> >>>        return p;
> >>>     }
> >>>
> >>>     similarly, the decode operation is represented as:
> >>>
> >>>     const char * simple::decode(const char *p) {
> >>>        p = decode(p,a);
> >>>        p = decode(p,b);
> >>>        p = decode(p,c);
> >>>        p = decode(p,d);
> >>>        return p;
> >>>     }
> >>>
> >>>
> >>> You can now see that it's possible to create a single function that
> >>> does all three operations in a single block of code, provided that
> >>> you can
> >> fiddle the input/output parameter types appropriately.
> >>>
> >>> In essence the pattern is
> >>>
> >>>     p = enc_dec(p,struct_field_1);
> >>>     p = enc_dec(p,struct_field_2);
> >>>     p = enc_dec(p,struct_field_3);
> >>>
> >>> With the type of p being set differently for each operation, i.e.,
> >>>     for ESTIMATE, p = size_t
> >>>     for ENCODE,   p = char *
> >>>     for DECODE,   p = const char *
> >>>
> >>> This is the essence of how the encode/decode framework operates.
> >> Though there is some more sophistication...
> >>>
> >>> ----------------------
> >>>
> >>> We also want to allow the encode/decode machinery to be per-type and
> >>> to operate
> >>>
> >>>
> >>
> **********************************************************
> >> ************
> >>> *******/
> >>>
> >>> using namespace std;
> >>>
> >>> //
> >>> // Just like the existing encode/decode machinery. The environment
> >>> provides a rich set of // pre-defined encodes for primitive types
> >>> and containers //
> >>>
> >>> #define DEFINE_ENC_DEC_RAW(type) \
> >>> inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
> >>> inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p +
> >> sizeof(type); } \
> >>> inline const char *enc_dec(const char *p,type &o) { o = *(const type
> >>> *)p; return p + sizeof(type); }
> >>>
> >>> DEFINE_ENC_DEC_RAW(int);
> >>> DEFINE_ENC_DEC_RAW(size_t);
> >>>
> >>> //
> >>> // String encode/decode (Yea, I know size_t isn't portable -- this
> >>> is an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s) {
> >>> return p + sizeof(size_t) + s.size(); } inline char * enc_dec(char *
> >>> p,string& s) { *(size_t *)p = s.size();
> >>> memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p +
> >>> sizeof(size_t)
> >>> + s.size(); } inline const char *enc_dec(const char *p,string& s) {
> >>> + s
> >>> = string(p + sizeof(size_t),*(size_t *)p); return p + sizeof(size_t)
> >>> + s.size(); }
> >>>
> >>> //
> >>> // Let's do a container.
> >>> //
> >>> // One of the problems with a container is that making an accurate
> >>> estimate of the size // would theoretically require that you walk
> >>> the entire
> >> container and add up the sizes of each element.
> >>> // We probably don't want to do that. So here, I do a hack that just
> >>> assumes that I can fake up a individual element // and multiple that
> >>> by the number of elements in a container. This hack works anytime
> >>> that the estimate function // for the contained type has a fixed
> maximum size.
> >> BTW, this is safe, if the contained type has a variable size //
> >> (like set<string>) then it will fault out the first time you run it.
> >>> //
> >>> // Naturally, something like set<string> or map<string,string> is a
> >>> highly desirable thing to be able to encode/decode // there's no
> >>> reason
> >> that you can't create a enc_dec_slow function that properly computes
> >> the maximum size by walking the container.
> >>> //
> >>> template<typename t>
> >>> inline size_t enc_dec(size_t p,set<t>& s) { return p +
> >>> sizeof(size_t)
> >>> + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }
> >>>
> >>> template<typename t>
> >>> inline char *enc_dec(char *p,set<t>& s) {
> >>>    size_t sz = s.size();
> >>>    p = enc_dec(p,sz);
> >>>    for (const t& e : s) {
> >>>       p = enc_dec(p,const_cast<t&>(e));
> >>>    }
> >>>    return p;
> >>> }
> >>>
> >>> template<typename t>
> >>> inline const char *enc_dec(const char *p,set<t>&s) {
> >>>    size_t sz;
> >>>    p = enc_dec(p,sz);
> >>>    while (sz--) {
> >>>       t temp;
> >>>       p = enc_dec(p,temp);
> >>>       s.insert(temp);
> >>>    }
> >>>    return p;
> >>> }
> >>>
> >>> //
> >>> // Specialized encode/decode for a single data type. These are
> >>> invoked
> >> explicitly...
> >>> //
> >>> inline size_t enc_dec_lba(size_t p,int& lba) {
> >>>    return p + sizeof(lba); // Max....
> >>> }
> >>>
> >>> inline char * enc_dec_lba(char *p,int& lba) {
> >>>    *p = 15;
> >>>    return p + 1; // blah blah
> >>> }
> >>>
> >>> inline const char *enc_dec_lba(const char *p,int& lba) {
> >>>    lba = *p;
> >>>    return p+1;
> >>> }
> >>>
> >>> //
> >>> // Specialized encode/decode for more sophisticated things primitives.
> >>> //
> >>> // Here's an example of a encode/decoder for a pair of fields //
> >>> inline size_t enc_dec_range(size_t p,short& start,short& end) {
> >>>    return p + 2 * sizeof(short);
> >>> }
> >>>
> >>> inline char *enc_dec_range(char *p, short& start, short& end) {
> >>>    short *s = (short *) p;
> >>>    s[0] = start;
> >>>    s[1] = end;
> >>>    return p + sizeof(short) * 2;
> >>> }
> >>>
> >>> inline const char *enc_dec_range(const char *p,short& start, short&
> end) {
> >>>    start = *(short *)p;
> >>>    end   = *(short *)(p + sizeof(short));
> >>>    return p + 2*sizeof(short);
> >>> }
> >>>
> >>>
> >>> //
> >>> // Some C++ template wizardry to make the single encode/decode
> >> function possible.
> >>> //
> >>> enum SERIAL_TYPE {
> >>>    ESTIMATE,
> >>>    ENCODE,
> >>>    DECODE
> >>> };
> >>>
> >>> template <enum SERIAL_TYPE s> struct serial_type;
> >>>
> >>> template<> struct serial_type<ESTIMATE> { typedef size_t type; };
> >>> template<> struct serial_type<ENCODE>   { typedef char * type; };
> >>> template<> struct serial_type<DECODE>   { typedef const char *type; };
> >>>
> >>> //
> >>> // This macro is the key, it connects the external non-member
> >>> function to
> >> the correct member function.
> >>> //
> >>> #define DEFINE_STRUCT_ENC_DEC(s) \
> >>> inline size_t      enc_dec(size_t p, s &o) { return
> o.enc_dec<ESTIMATE>(p); }
> >> \
> >>> inline char *      enc_dec(char *p , s &o)  { return
> o.enc_dec<ENCODE>(p); }
> >> \
> >>> inline const char *enc_dec(const char *p,s &o)  { return
> >>> o.enc_dec<DECODE>(p); }
> >>>
> >>> //
> >>> // Our example structure
> >>> //
> >>> struct astruct {
> >>>    int a;
> >>>    set<int> b;
> >>>    int lba;
> >>>    short start,end;
> >>>
> >>>    //
> >>>    // <<<<< You need to provide this function just one.
> >>>    //
> >>>    template<enum SERIAL_TYPE s> typename serial_type<s>::type
> >> enc_dec(typename serial_type<s>::type p) {
> >>>       p = ::enc_dec(p,a);
> >>>       p = ::enc_dec(p,b);
> >>>       p = ::enc_dec_lba(p,lba);
> >>>       p = ::enc_dec_range(p,start,end);
> >>>       return p;
> >>>    }
> >>> };
> >>>
> >>> //
> >>> // This macro connects the global enc_dec to the member function.
> >>> // One of these per struct declaration //
> >>> DEFINE_STRUCT_ENC_DEC(astruct);
> >>>
> >>>
> >>> //
> >>> // Here's a simple test program. The real encode/decode framework
> >>> needs to be connected to bufferlist using the pseudo-code // that I
> >> documented in my previous email.
> >>> //
> >>>
> >>> int main(int argc,char **argv) {
> >>>
> >>>    astruct a;
> >>>    a.a = 10;
> >>>    a.b.insert(2);
> >>>    a.b.insert(3);
> >>>    a.lba = 12;
> >>>
> >>>    size_t s = a.enc_dec<ESTIMATE>(size_t(0));
> >>>    cout << "Estimated size is " << s << "\n";
> >>>
> >>>    char buffer[100];
> >>>
> >>>    char *end = a.enc_dec<ENCODE>(buffer);
> >>>
> >>>    cout << "Actual storage was " << end-buffer << "\n";
> >>>
> >>>    astruct b;
> >>>
> >>>    (void) b.enc_dec<DECODE>(buffer); // decode it
> >>>
> >>>    cout << "A.a = " << b.a << "\n";
> >>>    for (auto e : b.b) {
> >>>       cout << " " << e;
> >>>    }
> >>>
> >>>    cout << "\n";
> >>>
> >>>    cout << "a.lba = " << b.lba << "\n";
> >>>
> >>>    return 0;
> >>> }
> >>>
> >>>
> >>> Allen Samuels
> >>> SanDisk |a Western Digital brand
> >>> 2880 Junction Avenue, San Jose, CA 95134
> >>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Mark Nelson [mailto:mnelson@redhat.com]
> >>>> Sent: Tuesday, July 12, 2016 8:13 PM
> >>>> To: Sage Weil <sweil@redhat.com>; Allen Samuels
> >>>> <Allen.Samuels@sandisk.com>
> >>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> >>>> Subject: Re: bluestore onode diet and encoding overhead
> >>>>
> >>>>
> >>>>
> >>>> On 07/12/2016 08:50 PM, Sage Weil wrote:
> >>>>> On Tue, 12 Jul 2016, Allen Samuels wrote:
> >>>>>> Good analysis.
> >>>>>>
> >>>>>> My original comments about putting the oNode on a diet included
> >>>>>> the idea of a "custom" encode/decode path for certain high-usage
> cases.
> >>>>>> At the time, Sage resisted going down that path hoping that a
> >>>>>> more optimized generic case would get the job done. Your analysis
> >>>>>> shows that while we've achieved significant space reduction this
> >>>>>> has come at the expense of CPU time -- which dominates small
> >>>>>> object performance (I suspect that eventually we'd discover that
> >>>>>> the variable length decode path would be responsible for a
> >>>>>> substantial read performance degradation also -- which may or may
> >>>>>> not be part of the read performance drop-off that you're seeing).
> >>>>>> This isn't a
> >> surprising
> >>>> result, though it is unfortunate.
> >>>>>>
> >>>>>> I believe we need to revisit the idea of custom encode/decode
> >>>>>> paths for high-usage cases, only now the gains need to be focused
> >>>>>> on CPU utilization as well as space efficiency.
> >>>>>
> >>>>> I still think we can get most or all of the way there in a generic
> >>>>> way by revising the way that we interact with bufferlist for
> >>>>> encode and
> >> decode.
> >>>>> We haven't actually tried to optimize this yet, and the current
> >>>>> code is pretty horribly inefficient (asserts all over the place,
> >>>>> and many layers of pointer indirection to do a simple append).  I
> >>>>> think we need to do two
> >>>>> things:
> >>>>>
> >>>>> 1) decode path: optimize the iterator class so that it has a const
> >>>>> char *current and const char *current_end that point into the
> >>>>> current buffer::ptr.  This way any decode will have a single
> >>>>> pointer
> >>>>> add+comparison to ensure there is enough data to copy before
> >>>>> add+falling into
> >>>>> the slow path (partial buffer, move to next buffer, etc.).
> >>>>>
> >>>>
> >>>> I don't have a good sense yet for how much this is hurting us in
> >>>> the read path.  We screwed something up in the last couple of weeks
> >>>> and small
> >> reads
> >>>> are quite slow.
> >>>>
> >>>>> 2) Having that comparison is still not ideal, but we shoudl
> >>>>> consider ways to get around that too.  For example, if we know
> >>>>> that we are going to decode N M-byte things, we could do an
> >>>>> iterator 'reserve' or 'check' that ensures we have a valid pointer
> >>>>> for that much and then proceed without checks.  The interface here
> >>>>> would be tricky, though, since in the slow case we'll span buffers
> >>>>> and need to magically fall back to a different decode path (hard
> >>>>> to maintain) or do a temporary copy (probably faster but we need
> >>>>> to ensure the iterator owns it and frees is later).  I'd say this
> >>>>> is step 2 and optional; step 1 will have the most
> >>>> benefit.
> >>>>>
> >>>>> 3) encode path: currently all encode methods take a bufferlist&
> >>>>> and the bufferlist itself as an append buffer.  I think this is
> >>>>> flawed and limiting.  Instead, we should make a new class called
> >>>>> buffer::list::appender (or similar) and templatize the encode
> >>>>> methods so they can take a safe_appender (which does bounds
> >>>>> checking) or an unsafe_appender (which does not).  For the latter,
> >>>>> the user takes responsibility for making sure there is enough
> >>>>> space by doing a
> >>>>> reserve() type call which returns an unsafe_appender, and it's
> >>>>> their job to make sure they don't shove too much data into it.
> >>>>> That should make the encode path a memcpy + ptr increment (for
> >>>>> savvy/optimized
> >>>> callers).
> >>>>
> >>>> Seems reasonable and similar in performance to what Piotr and I
> >>>> were discussing this morning.  As a very simple test I was thinking
> >>>> of doing a
> >> quick
> >>>> size computation and then passing that in to increase the
> >>>> append_buffer
> >> size
> >>>> when the bufferlist is created in Bluestore::_txc_write_nodes.  His
> >>>> idea
> >> went
> >>>> a bit farther to break the encapsulation, compute the fully encoded
> >>>> message, and dump it directly into a buffer of a computed size
> >>>> without
> >> the
> >>>> extra assert checks or bounds checking.  Obviously his idea would
> >>>> be
> >> faster
> >>>> but more work.
> >>>>
> >>>> It sounds like your solution would be similar but a bit more formalized.
> >>>>
> >>>>>
> >>>>> I suggest we use bluestore as a test case to make the interfaces
> >>>>> work and be fast.  If we succeed we can take advantage of it
> >>>>> across the reset of the code base as well.
> >>>>
> >>>> Do we have other places in the code with similar byte append
> behavior?
> >>>> That's what's really killing us I think, especially with how small
> >>>> the new append_buffer is when you run out of space when appending
> bytes.
> >>>>
> >>>>>
> >>>>> That's my thinking, at least.  I haven't had time to prototype it
> >>>>> out yet, but I think our goal should be to make the encode/decode
> >>>>> paths capable of being a memcpy + ptr addition in the fast path,
> >>>>> and let that guide the interface...
> >>>>>
> >>>>> sage
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>> majordomo
> >>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-07-14 14:10           ` Allen Samuels
@ 2016-08-12 16:18             ` Sage Weil
  2016-08-12 22:25               ` Allen Samuels
  0 siblings, 1 reply; 39+ messages in thread
From: Sage Weil @ 2016-08-12 16:18 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Mark Nelson, ceph-devel

Okay, I finally had some time to read through this thread!

On Thu, 14 Jul 2016, Allen Samuels wrote:
> Yes, I did actually run the code before I posted it.
> 
> w.r.t. varint encoding. You have two choices w.r.t. a variable length 
> encoded, you could examine the data to accurately predict the output 
> size OR you could just return a constant that represents the worst-case 
> (max) size.  For individual fields, it probably doesn't matter what you 
> chose, but for fields that are part of something in a container, you 
> probably want the option of NOT running down the container to size up 
> each element -- so you'd just choose the worst-case size for the 
> estimator.
> 
> Though this code doesn't show it, I wrote some pseudo-code in a previous 
> e-mail that glues this framework into the bufferlist stuff. That pseudo 
> code is well prepared for estimate functions that are too large (indeed, 
> it expects that to happen) and it naturally handles buffer overrun 
> detection.
> 
> I didn't describe it in the example, but this framework very naturally 
> handles versioning, you just add some code like:
> 
> Struct abc {
>    Int version;
>    Int a;
>    Int b; 
>    ..... enc_dec(p) {
>       ::enc_dec(p, version);
>       ::enc_dec(p, a);
>       If (s != DECODE || version > 5) ::enc_dec(p, b); // This field is present in all estimate and encode operations, but only in decode operations when version is > 5
>    }
> };

This is pretty cool.  I'm mainly nervous about the versioning stuff.  The 
current encode/decode scheme already has a length (so we're good there), 
and also two other fields: struct_v and compat_v, indicating the version 
of the encoding and the oldest decoder that can understand it.  I don't 
see any reason why that couldn't be replicated here.  I am a bit nervous 
about the decoding side, though, since it can get complicated.  What you 
have above is the common case, but even a moderately simple one (where we 
didn't have to do anything kludgey) looks like this:

	https://github.com/ceph/ceph/blob/master/src/osd/osd_types.cc#L2337

I suspect those conditionals would end up looking like

 if (s != DECODE || version > 5) {
   ::enc_dec(p, b);
 } if (s == DECODE) {
   b = default/compat value;
 }

I'm still trying to sort out in my head how this relates to the appender 
thing.  I think they're largely orthogonal, but the estimate function here 
could be used to drive the unsafe_appender stuff pretty seamlessly.  
Using the unsafe_appender manually is going to be a lot more error-prone, 
but should get the same performance benefit, without unifying the 
encode/decode stuff.

I'm a bit worried that the estimate process will be too slow, though.  On 
a complicated nested object, for example, it will have to traverse the 
full data structure once to estimate, and then again to encode.  It might 
be simpler and faster to have the outer parts of the structure operate on 
a safe_encoder, and construct an unsafe_encoder only when we are 
explicitly prepared to do the estimate.  For example, we can have a 
safe_encoder method like

  unsafe_encoder reserve(size_t s);

so that we can do

  void encode(bufferlist::safe_appender& ap) const {
    ENCODE_START(2, 2);
    // do a single range check for all of our simple members
    {
      unsafe_encoder t = ap.reserve(5 * sizeof(uint64_t));
      ::encode(foo, t);
      ::encode(bar, t);
      ::encode(baz, t);
      ::encode(a, t);
      ::encode(b, t);
    }
    // use the safe encoder for some complex ones
    ::encode(widget, ap);
    ::encode(widget2, ap);
    // explicitly estimate a simple container
    {
      unsafe_encoder t = ap.reserve(sub.size() * known_worst_case);
      ::encode(sub, t);
    }
    // dynamically range check a complex container
    ::encode(complex_container, ap);
  }

The enc_dec currently forces a full estimate in all cases, even when it's 
not really needed.  Perhaps we can come up with some set of templates 
and wrapper functions so that we can use safe and unsafe encoders somewhat 
interchangeably so that the estimate infrastructure is only triggered 
when needed?

sage



> What this framework doesn't yet handle very well is situations where you 
> have a container with a contained type that is a primitive (i.e., uint8) 
> and you want that contained type to be custom encoded. Currently, the 
> only solution is replace the contained primitive type with a class 
> wrapper. Unfortunately a typedef is NOT sufficient to differentiate it.
> 
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> 
> > -----Original Message-----
> > From: Mark Nelson [mailto:mnelson@redhat.com]
> > Sent: Thursday, July 14, 2016 4:16 AM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> > <sweil@redhat.com>
> > Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > Subject: Re: bluestore onode diet and encoding overhead
> > 
> > On 07/14/2016 12:52 AM, Allen Samuels wrote:
> > > As promised, here's some code that hacks out a new encode/decode
> > framework. That has the advantage of only having to list the fields of a struct
> > once and is pretty much guaranteed to never overrun a buffer....
> > >
> > > Comments are requested :)
> > 
> > It compiles! :D
> > 
> > I looked over the code, but I want to look it over again after I've had my
> > coffee since I'm still shaking the cobwebs out.  Would the idea here be that if
> > you are doing varint encoding for example that you always allocate the
> > buffer based on ESTIMATE (also taking into account the encoding overhead),
> > but typically expect a much smaller encoding?
> > 
> > As it is, it's very clever.
> > 
> > Mark
> > 
> > >
> > >
> > > #include <iostream>
> > > #include <fstream>
> > > #include <set>
> > > #include <string>
> > > #include <string.h>
> > >
> > > /*******************************************************
> > >
> > >
> > >    New fast encode/decode framework.
> > >
> > >    The entire framework is built around the idea that each object has three
> > operations:
> > >
> > >      ESTIMATE  -- worst-case estimate of the amount of storage required for
> > this object
> > >      ENCODE    -- encode object into buffer of size ESTIMATE
> > >      DECODE    -- encode object from buffer of size actual.
> > >
> > >    Each object has a single templated function that actually provides all three
> > operations in a single set of code.
> > >    But doing this, it's pretty much guaranteed that the ESTIMATE and the
> > ENCODE code are in harmony (i.e. that the estimate is correct)
> > >    it also saves a lot of typing/reading...
> > >
> > >    Generally, all three operations are provided on a single function name
> > with the input and return parameters overloaded to distinguish them.
> > >
> > >    It's observed that for each of the three operations there is a single value
> > which needs to be transmitted between each of the micro-encode/decode
> > calls
> > >    Yes, this is confusing, but let's look at a simple example
> > >
> > >     struct simple {
> > >       int a;
> > >       float b;
> > >       string c;
> > >       set<int> d;
> > >     };
> > >
> > >     To encode this struct we generate a function that does the micro-
> > encoding of each of the fields of the struct
> > >     Here's an example of a function that does the ESTIMATE operation.
> > >
> > >     size_t simple::estimate() {
> > >        return
> > >           sizeof(a) +
> > >           sizeof(b) +
> > >           c.size() +
> > >           d.size() * sizeof(int);
> > >     }
> > >
> > >     We're going to re-write it as:
> > >
> > >     size_t simple::estimate(size_t p) {
> > >        p = estimate(p,a);
> > >        p = estimate(p,b);
> > >        p = estimate(p,c);
> > >        p = estimate(p,d);
> > >        return p;
> > >     }
> > >
> > >     assuming that the sorta function:
> > >
> > >     template<typename t> size_t estimate(size_t p,t& o) { return p +
> > sizeof(o); }
> > >     template<typename t> size_t estimate(size_t p,set<t>& o) { return
> > > p + o.size() * sizeof(t); }
> > >
> > >
> > >     similarly, the encode operation is represented as:
> > >
> > >     char * simple::encode(char *p) {
> > >        p = encode(p,a);
> > >        p = encode(p,b);
> > >        p = encode(p,c);
> > >        p = encode(p,d);
> > >        return p;
> > >     }
> > >
> > >     similarly, the decode operation is represented as:
> > >
> > >     const char * simple::decode(const char *p) {
> > >        p = decode(p,a);
> > >        p = decode(p,b);
> > >        p = decode(p,c);
> > >        p = decode(p,d);
> > >        return p;
> > >     }
> > >
> > >
> > > You can now see that it's possible to create a single function that
> > > does all three operations in a single block of code, provided that you can
> > fiddle the input/output parameter types appropriately.
> > >
> > > In essence the pattern is
> > >
> > >     p = enc_dec(p,struct_field_1);
> > >     p = enc_dec(p,struct_field_2);
> > >     p = enc_dec(p,struct_field_3);
> > >
> > > With the type of p being set differently for each operation, i.e.,
> > >     for ESTIMATE, p = size_t
> > >     for ENCODE,   p = char *
> > >     for DECODE,   p = const char *
> > >
> > > This is the essence of how the encode/decode framework operates.
> > Though there is some more sophistication...
> > >
> > > ----------------------
> > >
> > > We also want to allow the encode/decode machinery to be per-type and
> > > to operate
> > >
> > >
> > **********************************************************
> > ************
> > > *******/
> > >
> > > using namespace std;
> > >
> > > //
> > > // Just like the existing encode/decode machinery. The environment
> > > provides a rich set of // pre-defined encodes for primitive types and
> > > containers //
> > >
> > > #define DEFINE_ENC_DEC_RAW(type) \
> > > inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
> > > inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p +
> > sizeof(type); } \
> > > inline const char *enc_dec(const char *p,type &o) { o = *(const type
> > > *)p; return p + sizeof(type); }
> > >
> > > DEFINE_ENC_DEC_RAW(int);
> > > DEFINE_ENC_DEC_RAW(size_t);
> > >
> > > //
> > > // String encode/decode (Yea, I know size_t isn't portable -- this is
> > > an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s) {
> > > return p + sizeof(size_t) + s.size(); } inline char * enc_dec(char *
> > > p,string& s) { *(size_t *)p = s.size();
> > > memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p + sizeof(size_t)
> > > + s.size(); } inline const char *enc_dec(const char *p,string& s) { s
> > > = string(p + sizeof(size_t),*(size_t *)p); return p + sizeof(size_t) +
> > > s.size(); }
> > >
> > > //
> > > // Let's do a container.
> > > //
> > > // One of the problems with a container is that making an accurate
> > > estimate of the size // would theoretically require that you walk the entire
> > container and add up the sizes of each element.
> > > // We probably don't want to do that. So here, I do a hack that just
> > > assumes that I can fake up a individual element // and multiple that
> > > by the number of elements in a container. This hack works anytime that
> > > the estimate function // for the contained type has a fixed maximum size.
> > BTW, this is safe, if the contained type has a variable size //  (like set<string>)
> > then it will fault out the first time you run it.
> > > //
> > > // Naturally, something like set<string> or map<string,string> is a
> > > highly desirable thing to be able to encode/decode // there's no reason
> > that you can't create a enc_dec_slow function that properly computes the
> > maximum size by walking the container.
> > > //
> > > template<typename t>
> > > inline size_t enc_dec(size_t p,set<t>& s) { return p + sizeof(size_t)
> > > + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }
> > >
> > > template<typename t>
> > > inline char *enc_dec(char *p,set<t>& s) {
> > >    size_t sz = s.size();
> > >    p = enc_dec(p,sz);
> > >    for (const t& e : s) {
> > >       p = enc_dec(p,const_cast<t&>(e));
> > >    }
> > >    return p;
> > > }
> > >
> > > template<typename t>
> > > inline const char *enc_dec(const char *p,set<t>&s) {
> > >    size_t sz;
> > >    p = enc_dec(p,sz);
> > >    while (sz--) {
> > >       t temp;
> > >       p = enc_dec(p,temp);
> > >       s.insert(temp);
> > >    }
> > >    return p;
> > > }
> > >
> > > //
> > > // Specialized encode/decode for a single data type. These are invoked
> > explicitly...
> > > //
> > > inline size_t enc_dec_lba(size_t p,int& lba) {
> > >    return p + sizeof(lba); // Max....
> > > }
> > >
> > > inline char * enc_dec_lba(char *p,int& lba) {
> > >    *p = 15;
> > >    return p + 1; // blah blah
> > > }
> > >
> > > inline const char *enc_dec_lba(const char *p,int& lba) {
> > >    lba = *p;
> > >    return p+1;
> > > }
> > >
> > > //
> > > // Specialized encode/decode for more sophisticated things primitives.
> > > //
> > > // Here's an example of a encode/decoder for a pair of fields //
> > > inline size_t enc_dec_range(size_t p,short& start,short& end) {
> > >    return p + 2 * sizeof(short);
> > > }
> > >
> > > inline char *enc_dec_range(char *p, short& start, short& end) {
> > >    short *s = (short *) p;
> > >    s[0] = start;
> > >    s[1] = end;
> > >    return p + sizeof(short) * 2;
> > > }
> > >
> > > inline const char *enc_dec_range(const char *p,short& start, short& end) {
> > >    start = *(short *)p;
> > >    end   = *(short *)(p + sizeof(short));
> > >    return p + 2*sizeof(short);
> > > }
> > >
> > >
> > > //
> > > // Some C++ template wizardry to make the single encode/decode
> > function possible.
> > > //
> > > enum SERIAL_TYPE {
> > >    ESTIMATE,
> > >    ENCODE,
> > >    DECODE
> > > };
> > >
> > > template <enum SERIAL_TYPE s> struct serial_type;
> > >
> > > template<> struct serial_type<ESTIMATE> { typedef size_t type; };
> > > template<> struct serial_type<ENCODE>   { typedef char * type; };
> > > template<> struct serial_type<DECODE>   { typedef const char *type; };
> > >
> > > //
> > > // This macro is the key, it connects the external non-member function to
> > the correct member function.
> > > //
> > > #define DEFINE_STRUCT_ENC_DEC(s) \
> > > inline size_t      enc_dec(size_t p, s &o) { return o.enc_dec<ESTIMATE>(p); }
> > \
> > > inline char *      enc_dec(char *p , s &o)  { return o.enc_dec<ENCODE>(p); }
> > \
> > > inline const char *enc_dec(const char *p,s &o)  { return
> > > o.enc_dec<DECODE>(p); }
> > >
> > > //
> > > // Our example structure
> > > //
> > > struct astruct {
> > >    int a;
> > >    set<int> b;
> > >    int lba;
> > >    short start,end;
> > >
> > >    //
> > >    // <<<<< You need to provide this function just one.
> > >    //
> > >    template<enum SERIAL_TYPE s> typename serial_type<s>::type
> > enc_dec(typename serial_type<s>::type p) {
> > >       p = ::enc_dec(p,a);
> > >       p = ::enc_dec(p,b);
> > >       p = ::enc_dec_lba(p,lba);
> > >       p = ::enc_dec_range(p,start,end);
> > >       return p;
> > >    }
> > > };
> > >
> > > //
> > > // This macro connects the global enc_dec to the member function.
> > > // One of these per struct declaration //
> > > DEFINE_STRUCT_ENC_DEC(astruct);
> > >
> > >
> > > //
> > > // Here's a simple test program. The real encode/decode framework
> > > needs to be connected to bufferlist using the pseudo-code // that I
> > documented in my previous email.
> > > //
> > >
> > > int main(int argc,char **argv) {
> > >
> > >    astruct a;
> > >    a.a = 10;
> > >    a.b.insert(2);
> > >    a.b.insert(3);
> > >    a.lba = 12;
> > >
> > >    size_t s = a.enc_dec<ESTIMATE>(size_t(0));
> > >    cout << "Estimated size is " << s << "\n";
> > >
> > >    char buffer[100];
> > >
> > >    char *end = a.enc_dec<ENCODE>(buffer);
> > >
> > >    cout << "Actual storage was " << end-buffer << "\n";
> > >
> > >    astruct b;
> > >
> > >    (void) b.enc_dec<DECODE>(buffer); // decode it
> > >
> > >    cout << "A.a = " << b.a << "\n";
> > >    for (auto e : b.b) {
> > >       cout << " " << e;
> > >    }
> > >
> > >    cout << "\n";
> > >
> > >    cout << "a.lba = " << b.lba << "\n";
> > >
> > >    return 0;
> > > }
> > >
> > >
> > > Allen Samuels
> > > SanDisk |a Western Digital brand
> > > 2880 Junction Avenue, San Jose, CA 95134
> > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > >
> > >
> > >> -----Original Message-----
> > >> From: Mark Nelson [mailto:mnelson@redhat.com]
> > >> Sent: Tuesday, July 12, 2016 8:13 PM
> > >> To: Sage Weil <sweil@redhat.com>; Allen Samuels
> > >> <Allen.Samuels@sandisk.com>
> > >> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > >> Subject: Re: bluestore onode diet and encoding overhead
> > >>
> > >>
> > >>
> > >> On 07/12/2016 08:50 PM, Sage Weil wrote:
> > >>> On Tue, 12 Jul 2016, Allen Samuels wrote:
> > >>>> Good analysis.
> > >>>>
> > >>>> My original comments about putting the oNode on a diet included the
> > >>>> idea of a "custom" encode/decode path for certain high-usage cases.
> > >>>> At the time, Sage resisted going down that path hoping that a more
> > >>>> optimized generic case would get the job done. Your analysis shows
> > >>>> that while we've achieved significant space reduction this has come
> > >>>> at the expense of CPU time -- which dominates small object
> > >>>> performance (I suspect that eventually we'd discover that the
> > >>>> variable length decode path would be responsible for a substantial
> > >>>> read performance degradation also -- which may or may not be part of
> > >>>> the read performance drop-off that you're seeing). This isn't a
> > surprising
> > >> result, though it is unfortunate.
> > >>>>
> > >>>> I believe we need to revisit the idea of custom encode/decode paths
> > >>>> for high-usage cases, only now the gains need to be focused on CPU
> > >>>> utilization as well as space efficiency.
> > >>>
> > >>> I still think we can get most or all of the way there in a generic way
> > >>> by revising the way that we interact with bufferlist for encode and
> > decode.
> > >>> We haven't actually tried to optimize this yet, and the current code
> > >>> is pretty horribly inefficient (asserts all over the place, and many
> > >>> layers of pointer indirection to do a simple append).  I think we need
> > >>> to do two
> > >>> things:
> > >>>
> > >>> 1) decode path: optimize the iterator class so that it has a const
> > >>> char *current and const char *current_end that point into the current
> > >>> buffer::ptr.  This way any decode will have a single pointer
> > >>> add+comparison to ensure there is enough data to copy before falling
> > >>> add+into
> > >>> the slow path (partial buffer, move to next buffer, etc.).
> > >>>
> > >>
> > >> I don't have a good sense yet for how much this is hurting us in the read
> > >> path.  We screwed something up in the last couple of weeks and small
> > reads
> > >> are quite slow.
> > >>
> > >>> 2) Having that comparison is still not ideal, but we shoudl consider
> > >>> ways to get around that too.  For example, if we know that we are
> > >>> going to decode N M-byte things, we could do an iterator 'reserve' or
> > >>> 'check' that ensures we have a valid pointer for that much and then
> > >>> proceed without checks.  The interface here would be tricky, though,
> > >>> since in the slow case we'll span buffers and need to magically fall
> > >>> back to a different decode path (hard to maintain) or do a temporary
> > >>> copy (probably faster but we need to ensure the iterator owns it and
> > >>> frees is later).  I'd say this is step 2 and optional; step 1 will have the most
> > >> benefit.
> > >>>
> > >>> 3) encode path: currently all encode methods take a bufferlist& and
> > >>> the bufferlist itself as an append buffer.  I think this is flawed and
> > >>> limiting.  Instead, we should make a new class called
> > >>> buffer::list::appender (or similar) and templatize the encode methods
> > >>> so they can take a safe_appender (which does bounds checking) or an
> > >>> unsafe_appender (which does not).  For the latter, the user takes
> > >>> responsibility for making sure there is enough space by doing a
> > >>> reserve() type call which returns an unsafe_appender, and it's their
> > >>> job to make sure they don't shove too much data into it.  That should
> > >>> make the encode path a memcpy + ptr increment (for savvy/optimized
> > >> callers).
> > >>
> > >> Seems reasonable and similar in performance to what Piotr and I were
> > >> discussing this morning.  As a very simple test I was thinking of doing a
> > quick
> > >> size computation and then passing that in to increase the append_buffer
> > size
> > >> when the bufferlist is created in Bluestore::_txc_write_nodes.  His idea
> > went
> > >> a bit farther to break the encapsulation, compute the fully encoded
> > >> message, and dump it directly into a buffer of a computed size without
> > the
> > >> extra assert checks or bounds checking.  Obviously his idea would be
> > faster
> > >> but more work.
> > >>
> > >> It sounds like your solution would be similar but a bit more formalized.
> > >>
> > >>>
> > >>> I suggest we use bluestore as a test case to make the interfaces work
> > >>> and be fast.  If we succeed we can take advantage of it across the
> > >>> reset of the code base as well.
> > >>
> > >> Do we have other places in the code with similar byte append behavior?
> > >> That's what's really killing us I think, especially with how small the new
> > >> append_buffer is when you run out of space when appending bytes.
> > >>
> > >>>
> > >>> That's my thinking, at least.  I haven't had time to prototype it out
> > >>> yet, but I think our goal should be to make the encode/decode paths
> > >>> capable of being a memcpy + ptr addition in the fast path, and let
> > >>> that guide the interface...
> > >>>
> > >>> sage
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >>> in the body of a message to majordomo@vger.kernel.org More
> > >> majordomo
> > >>> info at  http://vger.kernel.org/majordomo-info.html
> > >>>
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-08-12 16:18             ` Sage Weil
@ 2016-08-12 22:25               ` Allen Samuels
  2016-08-13 21:36                 ` Sage Weil
  0 siblings, 1 reply; 39+ messages in thread
From: Allen Samuels @ 2016-08-12 22:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, August 12, 2016 9:18 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Mark Nelson <mnelson@redhat.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: RE: bluestore onode diet and encoding overhead
> 
> Okay, I finally had some time to read through this thread!
> 
> On Thu, 14 Jul 2016, Allen Samuels wrote:
> > Yes, I did actually run the code before I posted it.
> >
> > w.r.t. varint encoding. You have two choices w.r.t. a variable length
> > encoded, you could examine the data to accurately predict the output
> > size OR you could just return a constant that represents the
> > worst-case
> > (max) size.  For individual fields, it probably doesn't matter what
> > you chose, but for fields that are part of something in a container,
> > you probably want the option of NOT running down the container to size
> > up each element -- so you'd just choose the worst-case size for the
> > estimator.
> >
> > Though this code doesn't show it, I wrote some pseudo-code in a
> > previous e-mail that glues this framework into the bufferlist stuff.
> > That pseudo code is well prepared for estimate functions that are too
> > large (indeed, it expects that to happen) and it naturally handles
> > buffer overrun detection.
> >
> > I didn't describe it in the example, but this framework very naturally
> > handles versioning, you just add some code like:
> >
> > Struct abc {
> >    Int version;
> >    Int a;
> >    Int b;
> >    ..... enc_dec(p) {
> >       ::enc_dec(p, version);
> >       ::enc_dec(p, a);
> >       If (s != DECODE || version > 5) ::enc_dec(p, b); // This field is present in
> all estimate and encode operations, but only in decode operations when
> version is > 5
> >    }
> > };
> 
> This is pretty cool.  I'm mainly nervous about the versioning stuff.  The
> current encode/decode scheme already has a length (so we're good there),
> and also two other fields: struct_v and compat_v, indicating the version of
> the encoding and the oldest decoder that can understand it.  I don't see any
> reason why that couldn't be replicated here.  I am a bit nervous about the
> decoding side, though, since it can get complicated.  What you have above is
> the common case, but even a moderately simple one (where we didn't have
> to do anything kludgey) looks like this:
> 
> 	https://github.com/ceph/ceph/blob/master/src/osd/osd_types.cc#L
> 2337

Yes, the struct_v and compat_v stuff can be easily handled and I don't think the code is any more "complicated" than what's done for the object that you showed. In fact the code is almost identical....


> 
> I suspect those conditionals would end up looking like
> 
>  if (s != DECODE || version > 5) {
>    ::enc_dec(p, b);
>  } if (s == DECODE) {
>    b = default/compat value;
>  }

Almost, I think it's simpler. Let's assume that the case we care about is a field that's present only in versions > 5. Then the code is: 

   If (version > 5) {
      ::enc_dec(p,b);
   }
     ....default....

Which is pretty much what it looks like in the current code...


> 
> I'm still trying to sort out in my head how this relates to the appender thing.  I
> think they're largely orthogonal, but the estimate function here could be
> used to drive the unsafe_appender stuff pretty seamlessly.
> Using the unsafe_appender manually is going to be a lot more error-prone,
> but should get the same performance benefit, without unifying the
> encode/decode stuff.

Yes, they're conceptually orthogonal. In all cases, we make a CONSERVATIVE estimate of the space that's required. Acquire that space and then encode without checking for overflow. At the end of the encode you can forgive the unused space.

> 
> I'm a bit worried that the estimate process will be too slow, though.  On a
> complicated nested object, for example, it will have to traverse the full data
> structure once to estimate, and then again to encode.  It might be simpler
> and faster to have the outer parts of the structure operate on a
> safe_encoder, and construct an unsafe_encoder only when we are explicitly
> prepared to do the estimate.  For example, we can have a safe_encoder
> method like

These are exactly the right tradeoffs. The framework allows you do specify for particular containers whether you need to walk the container or not to make an estimate. The optimization is only important for containers with large numbers of small objects -- in any other case (small numbers, large objects....) the overhead isn't important to optimize out.

BTW, there's nothing the framework that requires you to do the estimate process once for the whole world, you can absolutely doit in piecemeal like you suggest.

IMO, the most important thing about the estimate/encode cycle is that the estimate NEVER be wrong. I was very worried about a design pattern that separated the estimation process from the actual encoding process in a way that would allow this kind of error to creep in. What I'm worried about is an endless series of production run-time failures due to incorrect estimation for low probability data-dependent cases -- in other words, I consider it a requirement of the design that the estimation process be so tightly linked to the encoding process that possibilities of rare overruns are essentially eliminated by construction. So, I ended up with a design pattern that -- by default -- is guaranteed to "get a good enough" answer [i.e., conservative] at the expensive of time  -- but you can select
 ively override the default in way that I think is pretty safe to get that time back when it matters.

BTW, I expect most enc_dec calls for ESTIMATION to compile to just a few multiplies and adds for anything but a container you have to walk. 


> 
>   unsafe_encoder reserve(size_t s);
> 
> so that we can do
> 
>   void encode(bufferlist::safe_appender& ap) const {
>     ENCODE_START(2, 2);
>     // do a single range check for all of our simple members
>     {
>       unsafe_encoder t = ap.reserve(5 * sizeof(uint64_t));
>       ::encode(foo, t);
>       ::encode(bar, t);
>       ::encode(baz, t);
>       ::encode(a, t);
>       ::encode(b, t);
>     }
>     // use the safe encoder for some complex ones
>     ::encode(widget, ap);
>     ::encode(widget2, ap);
>     // explicitly estimate a simple container
>     {
>       unsafe_encoder t = ap.reserve(sub.size() * known_worst_case);
>       ::encode(sub, t);
>     }
>     // dynamically range check a complex container
>     ::encode(complex_container, ap);
>   }
> 
> The enc_dec currently forces a full estimate in all cases, even when it's not
> really needed.  Perhaps we can come up with some set of templates and
> wrapper functions so that we can use safe and unsafe encoders somewhat
> interchangeably so that the estimate infrastructure is only triggered when
> needed?

I think the characterization of "full estimate in all cases, even when it's not really needed" isn't correct. If you code the above using my framework then for all except the two "complex" containers, the estimation is pretty simple, just a few adds and multiplies.

It is true for the "complex" containers that you'll walk the container twice, once for the estimate and once for the encode. However, that's the only difference, the actual cost of encoding the elements of the two containers is the same (i.e., a reserve of the size followed by the encoding of the container itself) the only delta is that overhead of the for_each loop itself -- the cost of walking the container. I believe that if the container is "complex", then the cost of walking it, when compared to the cost of checking the buffer size and doing the per-element encode is minimal. In other words, I disagree about the cost of the estimate phase (you're doing the same work just re-distributed, I only walk the container itself once extra).


 
> 
> sage
> 
> 
> 
> > What this framework doesn't yet handle very well is situations where
> > you have a container with a contained type that is a primitive (i.e.,
> > uint8) and you want that contained type to be custom encoded.
> > Currently, the only solution is replace the contained primitive type
> > with a class wrapper. Unfortunately a typedef is NOT sufficient to
> differentiate it.
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> > > -----Original Message-----
> > > From: Mark Nelson [mailto:mnelson@redhat.com]
> > > Sent: Thursday, July 14, 2016 4:16 AM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> > > <sweil@redhat.com>
> > > Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > Subject: Re: bluestore onode diet and encoding overhead
> > >
> > > On 07/14/2016 12:52 AM, Allen Samuels wrote:
> > > > As promised, here's some code that hacks out a new encode/decode
> > > framework. That has the advantage of only having to list the fields
> > > of a struct once and is pretty much guaranteed to never overrun a
> buffer....
> > > >
> > > > Comments are requested :)
> > >
> > > It compiles! :D
> > >
> > > I looked over the code, but I want to look it over again after I've
> > > had my coffee since I'm still shaking the cobwebs out.  Would the
> > > idea here be that if you are doing varint encoding for example that
> > > you always allocate the buffer based on ESTIMATE (also taking into
> > > account the encoding overhead), but typically expect a much smaller
> encoding?
> > >
> > > As it is, it's very clever.
> > >
> > > Mark
> > >
> > > >
> > > >
> > > > #include <iostream>
> > > > #include <fstream>
> > > > #include <set>
> > > > #include <string>
> > > > #include <string.h>
> > > >
> > > >
> /*******************************************************
> > > >
> > > >
> > > >    New fast encode/decode framework.
> > > >
> > > >    The entire framework is built around the idea that each object
> > > > has three
> > > operations:
> > > >
> > > >      ESTIMATE  -- worst-case estimate of the amount of storage
> > > > required for
> > > this object
> > > >      ENCODE    -- encode object into buffer of size ESTIMATE
> > > >      DECODE    -- encode object from buffer of size actual.
> > > >
> > > >    Each object has a single templated function that actually
> > > > provides all three
> > > operations in a single set of code.
> > > >    But doing this, it's pretty much guaranteed that the ESTIMATE
> > > > and the
> > > ENCODE code are in harmony (i.e. that the estimate is correct)
> > > >    it also saves a lot of typing/reading...
> > > >
> > > >    Generally, all three operations are provided on a single
> > > > function name
> > > with the input and return parameters overloaded to distinguish them.
> > > >
> > > >    It's observed that for each of the three operations there is a
> > > > single value
> > > which needs to be transmitted between each of the
> > > micro-encode/decode calls
> > > >    Yes, this is confusing, but let's look at a simple example
> > > >
> > > >     struct simple {
> > > >       int a;
> > > >       float b;
> > > >       string c;
> > > >       set<int> d;
> > > >     };
> > > >
> > > >     To encode this struct we generate a function that does the
> > > > micro-
> > > encoding of each of the fields of the struct
> > > >     Here's an example of a function that does the ESTIMATE operation.
> > > >
> > > >     size_t simple::estimate() {
> > > >        return
> > > >           sizeof(a) +
> > > >           sizeof(b) +
> > > >           c.size() +
> > > >           d.size() * sizeof(int);
> > > >     }
> > > >
> > > >     We're going to re-write it as:
> > > >
> > > >     size_t simple::estimate(size_t p) {
> > > >        p = estimate(p,a);
> > > >        p = estimate(p,b);
> > > >        p = estimate(p,c);
> > > >        p = estimate(p,d);
> > > >        return p;
> > > >     }
> > > >
> > > >     assuming that the sorta function:
> > > >
> > > >     template<typename t> size_t estimate(size_t p,t& o) { return p
> > > > +
> > > sizeof(o); }
> > > >     template<typename t> size_t estimate(size_t p,set<t>& o) {
> > > > return p + o.size() * sizeof(t); }
> > > >
> > > >
> > > >     similarly, the encode operation is represented as:
> > > >
> > > >     char * simple::encode(char *p) {
> > > >        p = encode(p,a);
> > > >        p = encode(p,b);
> > > >        p = encode(p,c);
> > > >        p = encode(p,d);
> > > >        return p;
> > > >     }
> > > >
> > > >     similarly, the decode operation is represented as:
> > > >
> > > >     const char * simple::decode(const char *p) {
> > > >        p = decode(p,a);
> > > >        p = decode(p,b);
> > > >        p = decode(p,c);
> > > >        p = decode(p,d);
> > > >        return p;
> > > >     }
> > > >
> > > >
> > > > You can now see that it's possible to create a single function
> > > > that does all three operations in a single block of code, provided
> > > > that you can
> > > fiddle the input/output parameter types appropriately.
> > > >
> > > > In essence the pattern is
> > > >
> > > >     p = enc_dec(p,struct_field_1);
> > > >     p = enc_dec(p,struct_field_2);
> > > >     p = enc_dec(p,struct_field_3);
> > > >
> > > > With the type of p being set differently for each operation, i.e.,
> > > >     for ESTIMATE, p = size_t
> > > >     for ENCODE,   p = char *
> > > >     for DECODE,   p = const char *
> > > >
> > > > This is the essence of how the encode/decode framework operates.
> > > Though there is some more sophistication...
> > > >
> > > > ----------------------
> > > >
> > > > We also want to allow the encode/decode machinery to be per-type
> > > > and to operate
> > > >
> > > >
> > >
> **********************************************************
> > > ************
> > > > *******/
> > > >
> > > > using namespace std;
> > > >
> > > > //
> > > > // Just like the existing encode/decode machinery. The environment
> > > > provides a rich set of // pre-defined encodes for primitive types
> > > > and containers //
> > > >
> > > > #define DEFINE_ENC_DEC_RAW(type) \
> > > > inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
> > > > inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p +
> > > sizeof(type); } \
> > > > inline const char *enc_dec(const char *p,type &o) { o = *(const
> > > > type *)p; return p + sizeof(type); }
> > > >
> > > > DEFINE_ENC_DEC_RAW(int);
> > > > DEFINE_ENC_DEC_RAW(size_t);
> > > >
> > > > //
> > > > // String encode/decode (Yea, I know size_t isn't portable -- this
> > > > is an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s)
> > > > { return p + sizeof(size_t) + s.size(); } inline char *
> > > > enc_dec(char * p,string& s) { *(size_t *)p = s.size();
> > > > memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p +
> > > > sizeof(size_t)
> > > > + s.size(); } inline const char *enc_dec(const char *p,string& s)
> > > > + { s
> > > > = string(p + sizeof(size_t),*(size_t *)p); return p +
> > > > sizeof(size_t) + s.size(); }
> > > >
> > > > //
> > > > // Let's do a container.
> > > > //
> > > > // One of the problems with a container is that making an accurate
> > > > estimate of the size // would theoretically require that you walk
> > > > the entire
> > > container and add up the sizes of each element.
> > > > // We probably don't want to do that. So here, I do a hack that
> > > > just assumes that I can fake up a individual element // and
> > > > multiple that by the number of elements in a container. This hack
> > > > works anytime that the estimate function // for the contained type has
> a fixed maximum size.
> > > BTW, this is safe, if the contained type has a variable size //
> > > (like set<string>) then it will fault out the first time you run it.
> > > > //
> > > > // Naturally, something like set<string> or map<string,string> is
> > > > a highly desirable thing to be able to encode/decode // there's no
> > > > reason
> > > that you can't create a enc_dec_slow function that properly computes
> > > the maximum size by walking the container.
> > > > //
> > > > template<typename t>
> > > > inline size_t enc_dec(size_t p,set<t>& s) { return p +
> > > > sizeof(size_t)
> > > > + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }
> > > >
> > > > template<typename t>
> > > > inline char *enc_dec(char *p,set<t>& s) {
> > > >    size_t sz = s.size();
> > > >    p = enc_dec(p,sz);
> > > >    for (const t& e : s) {
> > > >       p = enc_dec(p,const_cast<t&>(e));
> > > >    }
> > > >    return p;
> > > > }
> > > >
> > > > template<typename t>
> > > > inline const char *enc_dec(const char *p,set<t>&s) {
> > > >    size_t sz;
> > > >    p = enc_dec(p,sz);
> > > >    while (sz--) {
> > > >       t temp;
> > > >       p = enc_dec(p,temp);
> > > >       s.insert(temp);
> > > >    }
> > > >    return p;
> > > > }
> > > >
> > > > //
> > > > // Specialized encode/decode for a single data type. These are
> > > > invoked
> > > explicitly...
> > > > //
> > > > inline size_t enc_dec_lba(size_t p,int& lba) {
> > > >    return p + sizeof(lba); // Max....
> > > > }
> > > >
> > > > inline char * enc_dec_lba(char *p,int& lba) {
> > > >    *p = 15;
> > > >    return p + 1; // blah blah
> > > > }
> > > >
> > > > inline const char *enc_dec_lba(const char *p,int& lba) {
> > > >    lba = *p;
> > > >    return p+1;
> > > > }
> > > >
> > > > //
> > > > // Specialized encode/decode for more sophisticated things primitives.
> > > > //
> > > > // Here's an example of a encode/decoder for a pair of fields //
> > > > inline size_t enc_dec_range(size_t p,short& start,short& end) {
> > > >    return p + 2 * sizeof(short);
> > > > }
> > > >
> > > > inline char *enc_dec_range(char *p, short& start, short& end) {
> > > >    short *s = (short *) p;
> > > >    s[0] = start;
> > > >    s[1] = end;
> > > >    return p + sizeof(short) * 2;
> > > > }
> > > >
> > > > inline const char *enc_dec_range(const char *p,short& start, short&
> end) {
> > > >    start = *(short *)p;
> > > >    end   = *(short *)(p + sizeof(short));
> > > >    return p + 2*sizeof(short);
> > > > }
> > > >
> > > >
> > > > //
> > > > // Some C++ template wizardry to make the single encode/decode
> > > function possible.
> > > > //
> > > > enum SERIAL_TYPE {
> > > >    ESTIMATE,
> > > >    ENCODE,
> > > >    DECODE
> > > > };
> > > >
> > > > template <enum SERIAL_TYPE s> struct serial_type;
> > > >
> > > > template<> struct serial_type<ESTIMATE> { typedef size_t type; };
> > > > template<> struct serial_type<ENCODE>   { typedef char * type; };
> > > > template<> struct serial_type<DECODE>   { typedef const char *type; };
> > > >
> > > > //
> > > > // This macro is the key, it connects the external non-member
> > > > function to
> > > the correct member function.
> > > > //
> > > > #define DEFINE_STRUCT_ENC_DEC(s) \
> > > > inline size_t      enc_dec(size_t p, s &o) { return
> o.enc_dec<ESTIMATE>(p); }
> > > \
> > > > inline char *      enc_dec(char *p , s &o)  { return
> o.enc_dec<ENCODE>(p); }
> > > \
> > > > inline const char *enc_dec(const char *p,s &o)  { return
> > > > o.enc_dec<DECODE>(p); }
> > > >
> > > > //
> > > > // Our example structure
> > > > //
> > > > struct astruct {
> > > >    int a;
> > > >    set<int> b;
> > > >    int lba;
> > > >    short start,end;
> > > >
> > > >    //
> > > >    // <<<<< You need to provide this function just one.
> > > >    //
> > > >    template<enum SERIAL_TYPE s> typename serial_type<s>::type
> > > enc_dec(typename serial_type<s>::type p) {
> > > >       p = ::enc_dec(p,a);
> > > >       p = ::enc_dec(p,b);
> > > >       p = ::enc_dec_lba(p,lba);
> > > >       p = ::enc_dec_range(p,start,end);
> > > >       return p;
> > > >    }
> > > > };
> > > >
> > > > //
> > > > // This macro connects the global enc_dec to the member function.
> > > > // One of these per struct declaration //
> > > > DEFINE_STRUCT_ENC_DEC(astruct);
> > > >
> > > >
> > > > //
> > > > // Here's a simple test program. The real encode/decode framework
> > > > needs to be connected to bufferlist using the pseudo-code // that
> > > > I
> > > documented in my previous email.
> > > > //
> > > >
> > > > int main(int argc,char **argv) {
> > > >
> > > >    astruct a;
> > > >    a.a = 10;
> > > >    a.b.insert(2);
> > > >    a.b.insert(3);
> > > >    a.lba = 12;
> > > >
> > > >    size_t s = a.enc_dec<ESTIMATE>(size_t(0));
> > > >    cout << "Estimated size is " << s << "\n";
> > > >
> > > >    char buffer[100];
> > > >
> > > >    char *end = a.enc_dec<ENCODE>(buffer);
> > > >
> > > >    cout << "Actual storage was " << end-buffer << "\n";
> > > >
> > > >    astruct b;
> > > >
> > > >    (void) b.enc_dec<DECODE>(buffer); // decode it
> > > >
> > > >    cout << "A.a = " << b.a << "\n";
> > > >    for (auto e : b.b) {
> > > >       cout << " " << e;
> > > >    }
> > > >
> > > >    cout << "\n";
> > > >
> > > >    cout << "a.lba = " << b.lba << "\n";
> > > >
> > > >    return 0;
> > > > }
> > > >
> > > >
> > > > Allen Samuels
> > > > SanDisk |a Western Digital brand
> > > > 2880 Junction Avenue, San Jose, CA 95134
> > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Mark Nelson [mailto:mnelson@redhat.com]
> > > >> Sent: Tuesday, July 12, 2016 8:13 PM
> > > >> To: Sage Weil <sweil@redhat.com>; Allen Samuels
> > > >> <Allen.Samuels@sandisk.com>
> > > >> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > >> Subject: Re: bluestore onode diet and encoding overhead
> > > >>
> > > >>
> > > >>
> > > >> On 07/12/2016 08:50 PM, Sage Weil wrote:
> > > >>> On Tue, 12 Jul 2016, Allen Samuels wrote:
> > > >>>> Good analysis.
> > > >>>>
> > > >>>> My original comments about putting the oNode on a diet included
> > > >>>> the idea of a "custom" encode/decode path for certain high-usage
> cases.
> > > >>>> At the time, Sage resisted going down that path hoping that a
> > > >>>> more optimized generic case would get the job done. Your
> > > >>>> analysis shows that while we've achieved significant space
> > > >>>> reduction this has come at the expense of CPU time -- which
> > > >>>> dominates small object performance (I suspect that eventually
> > > >>>> we'd discover that the variable length decode path would be
> > > >>>> responsible for a substantial read performance degradation also
> > > >>>> -- which may or may not be part of the read performance
> > > >>>> drop-off that you're seeing). This isn't a
> > > surprising
> > > >> result, though it is unfortunate.
> > > >>>>
> > > >>>> I believe we need to revisit the idea of custom encode/decode
> > > >>>> paths for high-usage cases, only now the gains need to be
> > > >>>> focused on CPU utilization as well as space efficiency.
> > > >>>
> > > >>> I still think we can get most or all of the way there in a
> > > >>> generic way by revising the way that we interact with bufferlist
> > > >>> for encode and
> > > decode.
> > > >>> We haven't actually tried to optimize this yet, and the current
> > > >>> code is pretty horribly inefficient (asserts all over the place,
> > > >>> and many layers of pointer indirection to do a simple append).
> > > >>> I think we need to do two
> > > >>> things:
> > > >>>
> > > >>> 1) decode path: optimize the iterator class so that it has a
> > > >>> const char *current and const char *current_end that point into
> > > >>> the current buffer::ptr.  This way any decode will have a single
> > > >>> pointer
> > > >>> add+comparison to ensure there is enough data to copy before
> > > >>> add+falling into
> > > >>> the slow path (partial buffer, move to next buffer, etc.).
> > > >>>
> > > >>
> > > >> I don't have a good sense yet for how much this is hurting us in
> > > >> the read path.  We screwed something up in the last couple of
> > > >> weeks and small
> > > reads
> > > >> are quite slow.
> > > >>
> > > >>> 2) Having that comparison is still not ideal, but we shoudl
> > > >>> consider ways to get around that too.  For example, if we know
> > > >>> that we are going to decode N M-byte things, we could do an
> > > >>> iterator 'reserve' or 'check' that ensures we have a valid
> > > >>> pointer for that much and then proceed without checks.  The
> > > >>> interface here would be tricky, though, since in the slow case
> > > >>> we'll span buffers and need to magically fall back to a
> > > >>> different decode path (hard to maintain) or do a temporary copy
> > > >>> (probably faster but we need to ensure the iterator owns it and
> > > >>> frees is later).  I'd say this is step 2 and optional; step 1
> > > >>> will have the most
> > > >> benefit.
> > > >>>
> > > >>> 3) encode path: currently all encode methods take a bufferlist&
> > > >>> and the bufferlist itself as an append buffer.  I think this is
> > > >>> flawed and limiting.  Instead, we should make a new class called
> > > >>> buffer::list::appender (or similar) and templatize the encode
> > > >>> methods so they can take a safe_appender (which does bounds
> > > >>> checking) or an unsafe_appender (which does not).  For the
> > > >>> latter, the user takes responsibility for making sure there is
> > > >>> enough space by doing a
> > > >>> reserve() type call which returns an unsafe_appender, and it's
> > > >>> their job to make sure they don't shove too much data into it.
> > > >>> That should make the encode path a memcpy + ptr increment (for
> > > >>> savvy/optimized
> > > >> callers).
> > > >>
> > > >> Seems reasonable and similar in performance to what Piotr and I
> > > >> were discussing this morning.  As a very simple test I was
> > > >> thinking of doing a
> > > quick
> > > >> size computation and then passing that in to increase the
> > > >> append_buffer
> > > size
> > > >> when the bufferlist is created in Bluestore::_txc_write_nodes.
> > > >> His idea
> > > went
> > > >> a bit farther to break the encapsulation, compute the fully
> > > >> encoded message, and dump it directly into a buffer of a computed
> > > >> size without
> > > the
> > > >> extra assert checks or bounds checking.  Obviously his idea would
> > > >> be
> > > faster
> > > >> but more work.
> > > >>
> > > >> It sounds like your solution would be similar but a bit more formalized.
> > > >>
> > > >>>
> > > >>> I suggest we use bluestore as a test case to make the interfaces
> > > >>> work and be fast.  If we succeed we can take advantage of it
> > > >>> across the reset of the code base as well.
> > > >>
> > > >> Do we have other places in the code with similar byte append
> behavior?
> > > >> That's what's really killing us I think, especially with how
> > > >> small the new append_buffer is when you run out of space when
> appending bytes.
> > > >>
> > > >>>
> > > >>> That's my thinking, at least.  I haven't had time to prototype
> > > >>> it out yet, but I think our goal should be to make the
> > > >>> encode/decode paths capable of being a memcpy + ptr addition in
> > > >>> the fast path, and let that guide the interface...
> > > >>>
> > > >>> sage
> > > >>> --
> > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > >>> in the body of a message to majordomo@vger.kernel.org More
> > > >> majordomo
> > > >>> info at  http://vger.kernel.org/majordomo-info.html
> > > >>>
> >
> >

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-08-12 22:25               ` Allen Samuels
@ 2016-08-13 21:36                 ` Sage Weil
  2016-08-14 20:37                   ` Allen Samuels
  0 siblings, 1 reply; 39+ messages in thread
From: Sage Weil @ 2016-08-13 21:36 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Mark Nelson, ceph-devel

On Fri, 12 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, August 12, 2016 9:18 AM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Mark Nelson <mnelson@redhat.com>; ceph-devel <ceph-
> > devel@vger.kernel.org>
> > Subject: RE: bluestore onode diet and encoding overhead
> > 
> > Okay, I finally had some time to read through this thread!
> > 
> > On Thu, 14 Jul 2016, Allen Samuels wrote:
> > > Yes, I did actually run the code before I posted it.
> > >
> > > w.r.t. varint encoding. You have two choices w.r.t. a variable length
> > > encoded, you could examine the data to accurately predict the output
> > > size OR you could just return a constant that represents the
> > > worst-case
> > > (max) size.  For individual fields, it probably doesn't matter what
> > > you chose, but for fields that are part of something in a container,
> > > you probably want the option of NOT running down the container to size
> > > up each element -- so you'd just choose the worst-case size for the
> > > estimator.
> > >
> > > Though this code doesn't show it, I wrote some pseudo-code in a
> > > previous e-mail that glues this framework into the bufferlist stuff.
> > > That pseudo code is well prepared for estimate functions that are too
> > > large (indeed, it expects that to happen) and it naturally handles
> > > buffer overrun detection.
> > >
> > > I didn't describe it in the example, but this framework very naturally
> > > handles versioning, you just add some code like:
> > >
> > > Struct abc {
> > >    Int version;
> > >    Int a;
> > >    Int b;
> > >    ..... enc_dec(p) {
> > >       ::enc_dec(p, version);
> > >       ::enc_dec(p, a);
> > >       If (s != DECODE || version > 5) ::enc_dec(p, b); // This field is present in
> > all estimate and encode operations, but only in decode operations when
> > version is > 5
> > >    }
> > > };
> > 
> > This is pretty cool.  I'm mainly nervous about the versioning stuff.  The
> > current encode/decode scheme already has a length (so we're good there),
> > and also two other fields: struct_v and compat_v, indicating the version of
> > the encoding and the oldest decoder that can understand it.  I don't see any
> > reason why that couldn't be replicated here.  I am a bit nervous about the
> > decoding side, though, since it can get complicated.  What you have above is
> > the common case, but even a moderately simple one (where we didn't have
> > to do anything kludgey) looks like this:
> > 
> > 	https://github.com/ceph/ceph/blob/master/src/osd/osd_types.cc#L
> > 2337
> 
> Yes, the struct_v and compat_v stuff can be easily handled and I don't think the code is any more "complicated" than what's done for the object that you showed. In fact the code is almost identical....
> 
> 
> > 
> > I suspect those conditionals would end up looking like
> > 
> >  if (s != DECODE || version > 5) {
> >    ::enc_dec(p, b);
> >  } if (s == DECODE) {
> >    b = default/compat value;
> >  }
> 
> Almost, I think it's simpler. Let's assume that the case we care about is a field that's present only in versions > 5. Then the code is: 
> 
>    if (version > 5) {
>       ::enc_dec(p,b);
     } else {
        b = 1;  // compat value
     }    

If we set a the version/struct_v var in the encode and estimate paths to 
the latest version, then yeah.  It may read a bit strange because the 
compat block appears will only be visited in the decode path (when 
struct_v could be small), but that should be fine.

> 
> Which is pretty much what it looks like in the current code...
> 
> 
> > 
> > I'm still trying to sort out in my head how this relates to the appender thing.  I
> > think they're largely orthogonal, but the estimate function here could be
> > used to drive the unsafe_appender stuff pretty seamlessly.
> > Using the unsafe_appender manually is going to be a lot more error-prone,
> > but should get the same performance benefit, without unifying the
> > encode/decode stuff.
> 
> Yes, they're conceptually orthogonal. In all cases, we make a 
> CONSERVATIVE estimate of the space that's required. Acquire that space 
> and then encode without checking for overflow. At the end of the encode 
> you can forgive the unused space.
> 
> > 
> > I'm a bit worried that the estimate process will be too slow, though.  On a
> > complicated nested object, for example, it will have to traverse the full data
> > structure once to estimate, and then again to encode.  It might be simpler
> > and faster to have the outer parts of the structure operate on a
> > safe_encoder, and construct an unsafe_encoder only when we are explicitly
> > prepared to do the estimate.  For example, we can have a safe_encoder
> > method like
> 
> These are exactly the right tradeoffs. The framework allows you do 
> specify for particular containers whether you need to walk the container 
> or not to make an estimate. The optimization is only important for 
> containers with large numbers of small objects -- in any other case 
> (small numbers, large objects....) the overhead isn't important to 
> optimize out.
> 
> BTW, there's nothing the framework that requires you to do the estimate 
> process once for the whole world, you can absolutely doit in piecemeal 
> like you suggest.

I'm having trouble imagining what this code is going to look like.  How 
close is your branch to a point where I can start playing with it?
 
> IMO, the most important thing about the estimate/encode cycle is that 
> the estimate NEVER be wrong. I was very worried about a design pattern 
> that separated the estimation process from the actual encoding process 
> in a way that would allow this kind of error to creep in. What I'm 
> worried about is an endless series of production run-time failures due 
> to incorrect estimation for low probability data-dependent cases -- in 
> other words, I consider it a requirement of the design that the 
> estimation process be so tightly linked to the encoding process that 
> possibilities of rare overruns are essentially eliminated by 
> construction. So, I ended up with a design pattern that -- by default -- 
> is guaranteed to "get a good enough" answer [i.e., conservative] at the 
> expensive of time -- but you can selectively override the default in way 
> that I think is pretty safe to get that time back when it matters.

Yeah--I agree here.

sage

> BTW, I expect most enc_dec calls for ESTIMATION to compile to just a few 
> multiplies and adds for anything but a container you have to walk.
> 
> 
> > 
> >   unsafe_encoder reserve(size_t s);
> > 
> > so that we can do
> > 
> >   void encode(bufferlist::safe_appender& ap) const {
> >     ENCODE_START(2, 2);
> >     // do a single range check for all of our simple members
> >     {
> >       unsafe_encoder t = ap.reserve(5 * sizeof(uint64_t));
> >       ::encode(foo, t);
> >       ::encode(bar, t);
> >       ::encode(baz, t);
> >       ::encode(a, t);
> >       ::encode(b, t);
> >     }
> >     // use the safe encoder for some complex ones
> >     ::encode(widget, ap);
> >     ::encode(widget2, ap);
> >     // explicitly estimate a simple container
> >     {
> >       unsafe_encoder t = ap.reserve(sub.size() * known_worst_case);
> >       ::encode(sub, t);
> >     }
> >     // dynamically range check a complex container
> >     ::encode(complex_container, ap);
> >   }
> > 
> > The enc_dec currently forces a full estimate in all cases, even when it's not
> > really needed.  Perhaps we can come up with some set of templates and
> > wrapper functions so that we can use safe and unsafe encoders somewhat
> > interchangeably so that the estimate infrastructure is only triggered when
> > needed?
> 
> I think the characterization of "full estimate in all cases, even when it's not really needed" isn't correct. If you code the above using my framework then for all except the two "complex" containers, the estimation is pretty simple, just a few adds and multiplies.
> 
> It is true for the "complex" containers that you'll walk the container twice, once for the estimate and once for the encode. However, that's the only difference, the actual cost of encoding the elements of the two containers is the same (i.e., a reserve of the size followed by the encoding of the container itself) the only delta is that overhead of the for_each loop itself -- the cost of walking the container. I believe that if the container is "complex", then the cost of walking it, when compared to the cost of checking the buffer size and doing the per-element encode is minimal. In other words, I disagree about the cost of the estimate phase (you're doing the same work just re-distributed, I only walk the container itself once extra).
> 
> 
>  
> > 
> > sage
> > 
> > 
> > 
> > > What this framework doesn't yet handle very well is situations where
> > > you have a container with a contained type that is a primitive (i.e.,
> > > uint8) and you want that contained type to be custom encoded.
> > > Currently, the only solution is replace the contained primitive type
> > > with a class wrapper. Unfortunately a typedef is NOT sufficient to
> > differentiate it.
> > >
> > > Allen Samuels
> > > SanDisk |a Western Digital brand
> > > 2880 Junction Avenue, San Jose, CA 95134
> > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > >
> > >
> > > > -----Original Message-----
> > > > From: Mark Nelson [mailto:mnelson@redhat.com]
> > > > Sent: Thursday, July 14, 2016 4:16 AM
> > > > To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> > > > <sweil@redhat.com>
> > > > Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > Subject: Re: bluestore onode diet and encoding overhead
> > > >
> > > > On 07/14/2016 12:52 AM, Allen Samuels wrote:
> > > > > As promised, here's some code that hacks out a new encode/decode
> > > > framework. That has the advantage of only having to list the fields
> > > > of a struct once and is pretty much guaranteed to never overrun a
> > buffer....
> > > > >
> > > > > Comments are requested :)
> > > >
> > > > It compiles! :D
> > > >
> > > > I looked over the code, but I want to look it over again after I've
> > > > had my coffee since I'm still shaking the cobwebs out.  Would the
> > > > idea here be that if you are doing varint encoding for example that
> > > > you always allocate the buffer based on ESTIMATE (also taking into
> > > > account the encoding overhead), but typically expect a much smaller
> > encoding?
> > > >
> > > > As it is, it's very clever.
> > > >
> > > > Mark
> > > >
> > > > >
> > > > >
> > > > > #include <iostream>
> > > > > #include <fstream>
> > > > > #include <set>
> > > > > #include <string>
> > > > > #include <string.h>
> > > > >
> > > > >
> > /*******************************************************
> > > > >
> > > > >
> > > > >    New fast encode/decode framework.
> > > > >
> > > > >    The entire framework is built around the idea that each object
> > > > > has three
> > > > operations:
> > > > >
> > > > >      ESTIMATE  -- worst-case estimate of the amount of storage
> > > > > required for
> > > > this object
> > > > >      ENCODE    -- encode object into buffer of size ESTIMATE
> > > > >      DECODE    -- encode object from buffer of size actual.
> > > > >
> > > > >    Each object has a single templated function that actually
> > > > > provides all three
> > > > operations in a single set of code.
> > > > >    But doing this, it's pretty much guaranteed that the ESTIMATE
> > > > > and the
> > > > ENCODE code are in harmony (i.e. that the estimate is correct)
> > > > >    it also saves a lot of typing/reading...
> > > > >
> > > > >    Generally, all three operations are provided on a single
> > > > > function name
> > > > with the input and return parameters overloaded to distinguish them.
> > > > >
> > > > >    It's observed that for each of the three operations there is a
> > > > > single value
> > > > which needs to be transmitted between each of the
> > > > micro-encode/decode calls
> > > > >    Yes, this is confusing, but let's look at a simple example
> > > > >
> > > > >     struct simple {
> > > > >       int a;
> > > > >       float b;
> > > > >       string c;
> > > > >       set<int> d;
> > > > >     };
> > > > >
> > > > >     To encode this struct we generate a function that does the
> > > > > micro-
> > > > encoding of each of the fields of the struct
> > > > >     Here's an example of a function that does the ESTIMATE operation.
> > > > >
> > > > >     size_t simple::estimate() {
> > > > >        return
> > > > >           sizeof(a) +
> > > > >           sizeof(b) +
> > > > >           c.size() +
> > > > >           d.size() * sizeof(int);
> > > > >     }
> > > > >
> > > > >     We're going to re-write it as:
> > > > >
> > > > >     size_t simple::estimate(size_t p) {
> > > > >        p = estimate(p,a);
> > > > >        p = estimate(p,b);
> > > > >        p = estimate(p,c);
> > > > >        p = estimate(p,d);
> > > > >        return p;
> > > > >     }
> > > > >
> > > > >     assuming that the sorta function:
> > > > >
> > > > >     template<typename t> size_t estimate(size_t p,t& o) { return p
> > > > > +
> > > > sizeof(o); }
> > > > >     template<typename t> size_t estimate(size_t p,set<t>& o) {
> > > > > return p + o.size() * sizeof(t); }
> > > > >
> > > > >
> > > > >     similarly, the encode operation is represented as:
> > > > >
> > > > >     char * simple::encode(char *p) {
> > > > >        p = encode(p,a);
> > > > >        p = encode(p,b);
> > > > >        p = encode(p,c);
> > > > >        p = encode(p,d);
> > > > >        return p;
> > > > >     }
> > > > >
> > > > >     similarly, the decode operation is represented as:
> > > > >
> > > > >     const char * simple::decode(const char *p) {
> > > > >        p = decode(p,a);
> > > > >        p = decode(p,b);
> > > > >        p = decode(p,c);
> > > > >        p = decode(p,d);
> > > > >        return p;
> > > > >     }
> > > > >
> > > > >
> > > > > You can now see that it's possible to create a single function
> > > > > that does all three operations in a single block of code, provided
> > > > > that you can
> > > > fiddle the input/output parameter types appropriately.
> > > > >
> > > > > In essence the pattern is
> > > > >
> > > > >     p = enc_dec(p,struct_field_1);
> > > > >     p = enc_dec(p,struct_field_2);
> > > > >     p = enc_dec(p,struct_field_3);
> > > > >
> > > > > With the type of p being set differently for each operation, i.e.,
> > > > >     for ESTIMATE, p = size_t
> > > > >     for ENCODE,   p = char *
> > > > >     for DECODE,   p = const char *
> > > > >
> > > > > This is the essence of how the encode/decode framework operates.
> > > > Though there is some more sophistication...
> > > > >
> > > > > ----------------------
> > > > >
> > > > > We also want to allow the encode/decode machinery to be per-type
> > > > > and to operate
> > > > >
> > > > >
> > > >
> > **********************************************************
> > > > ************
> > > > > *******/
> > > > >
> > > > > using namespace std;
> > > > >
> > > > > //
> > > > > // Just like the existing encode/decode machinery. The environment
> > > > > provides a rich set of // pre-defined encodes for primitive types
> > > > > and containers //
> > > > >
> > > > > #define DEFINE_ENC_DEC_RAW(type) \
> > > > > inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
> > > > > inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p +
> > > > sizeof(type); } \
> > > > > inline const char *enc_dec(const char *p,type &o) { o = *(const
> > > > > type *)p; return p + sizeof(type); }
> > > > >
> > > > > DEFINE_ENC_DEC_RAW(int);
> > > > > DEFINE_ENC_DEC_RAW(size_t);
> > > > >
> > > > > //
> > > > > // String encode/decode (Yea, I know size_t isn't portable -- this
> > > > > is an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s)
> > > > > { return p + sizeof(size_t) + s.size(); } inline char *
> > > > > enc_dec(char * p,string& s) { *(size_t *)p = s.size();
> > > > > memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p +
> > > > > sizeof(size_t)
> > > > > + s.size(); } inline const char *enc_dec(const char *p,string& s)
> > > > > + { s
> > > > > = string(p + sizeof(size_t),*(size_t *)p); return p +
> > > > > sizeof(size_t) + s.size(); }
> > > > >
> > > > > //
> > > > > // Let's do a container.
> > > > > //
> > > > > // One of the problems with a container is that making an accurate
> > > > > estimate of the size // would theoretically require that you walk
> > > > > the entire
> > > > container and add up the sizes of each element.
> > > > > // We probably don't want to do that. So here, I do a hack that
> > > > > just assumes that I can fake up a individual element // and
> > > > > multiple that by the number of elements in a container. This hack
> > > > > works anytime that the estimate function // for the contained type has
> > a fixed maximum size.
> > > > BTW, this is safe, if the contained type has a variable size //
> > > > (like set<string>) then it will fault out the first time you run it.
> > > > > //
> > > > > // Naturally, something like set<string> or map<string,string> is
> > > > > a highly desirable thing to be able to encode/decode // there's no
> > > > > reason
> > > > that you can't create a enc_dec_slow function that properly computes
> > > > the maximum size by walking the container.
> > > > > //
> > > > > template<typename t>
> > > > > inline size_t enc_dec(size_t p,set<t>& s) { return p +
> > > > > sizeof(size_t)
> > > > > + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }
> > > > >
> > > > > template<typename t>
> > > > > inline char *enc_dec(char *p,set<t>& s) {
> > > > >    size_t sz = s.size();
> > > > >    p = enc_dec(p,sz);
> > > > >    for (const t& e : s) {
> > > > >       p = enc_dec(p,const_cast<t&>(e));
> > > > >    }
> > > > >    return p;
> > > > > }
> > > > >
> > > > > template<typename t>
> > > > > inline const char *enc_dec(const char *p,set<t>&s) {
> > > > >    size_t sz;
> > > > >    p = enc_dec(p,sz);
> > > > >    while (sz--) {
> > > > >       t temp;
> > > > >       p = enc_dec(p,temp);
> > > > >       s.insert(temp);
> > > > >    }
> > > > >    return p;
> > > > > }
> > > > >
> > > > > //
> > > > > // Specialized encode/decode for a single data type. These are
> > > > > invoked
> > > > explicitly...
> > > > > //
> > > > > inline size_t enc_dec_lba(size_t p,int& lba) {
> > > > >    return p + sizeof(lba); // Max....
> > > > > }
> > > > >
> > > > > inline char * enc_dec_lba(char *p,int& lba) {
> > > > >    *p = 15;
> > > > >    return p + 1; // blah blah
> > > > > }
> > > > >
> > > > > inline const char *enc_dec_lba(const char *p,int& lba) {
> > > > >    lba = *p;
> > > > >    return p+1;
> > > > > }
> > > > >
> > > > > //
> > > > > // Specialized encode/decode for more sophisticated things primitives.
> > > > > //
> > > > > // Here's an example of a encode/decoder for a pair of fields //
> > > > > inline size_t enc_dec_range(size_t p,short& start,short& end) {
> > > > >    return p + 2 * sizeof(short);
> > > > > }
> > > > >
> > > > > inline char *enc_dec_range(char *p, short& start, short& end) {
> > > > >    short *s = (short *) p;
> > > > >    s[0] = start;
> > > > >    s[1] = end;
> > > > >    return p + sizeof(short) * 2;
> > > > > }
> > > > >
> > > > > inline const char *enc_dec_range(const char *p,short& start, short&
> > end) {
> > > > >    start = *(short *)p;
> > > > >    end   = *(short *)(p + sizeof(short));
> > > > >    return p + 2*sizeof(short);
> > > > > }
> > > > >
> > > > >
> > > > > //
> > > > > // Some C++ template wizardry to make the single encode/decode
> > > > function possible.
> > > > > //
> > > > > enum SERIAL_TYPE {
> > > > >    ESTIMATE,
> > > > >    ENCODE,
> > > > >    DECODE
> > > > > };
> > > > >
> > > > > template <enum SERIAL_TYPE s> struct serial_type;
> > > > >
> > > > > template<> struct serial_type<ESTIMATE> { typedef size_t type; };
> > > > > template<> struct serial_type<ENCODE>   { typedef char * type; };
> > > > > template<> struct serial_type<DECODE>   { typedef const char *type; };
> > > > >
> > > > > //
> > > > > // This macro is the key, it connects the external non-member
> > > > > function to
> > > > the correct member function.
> > > > > //
> > > > > #define DEFINE_STRUCT_ENC_DEC(s) \
> > > > > inline size_t      enc_dec(size_t p, s &o) { return
> > o.enc_dec<ESTIMATE>(p); }
> > > > \
> > > > > inline char *      enc_dec(char *p , s &o)  { return
> > o.enc_dec<ENCODE>(p); }
> > > > \
> > > > > inline const char *enc_dec(const char *p,s &o)  { return
> > > > > o.enc_dec<DECODE>(p); }
> > > > >
> > > > > //
> > > > > // Our example structure
> > > > > //
> > > > > struct astruct {
> > > > >    int a;
> > > > >    set<int> b;
> > > > >    int lba;
> > > > >    short start,end;
> > > > >
> > > > >    //
> > > > >    // <<<<< You need to provide this function just one.
> > > > >    //
> > > > >    template<enum SERIAL_TYPE s> typename serial_type<s>::type
> > > > enc_dec(typename serial_type<s>::type p) {
> > > > >       p = ::enc_dec(p,a);
> > > > >       p = ::enc_dec(p,b);
> > > > >       p = ::enc_dec_lba(p,lba);
> > > > >       p = ::enc_dec_range(p,start,end);
> > > > >       return p;
> > > > >    }
> > > > > };
> > > > >
> > > > > //
> > > > > // This macro connects the global enc_dec to the member function.
> > > > > // One of these per struct declaration //
> > > > > DEFINE_STRUCT_ENC_DEC(astruct);
> > > > >
> > > > >
> > > > > //
> > > > > // Here's a simple test program. The real encode/decode framework
> > > > > needs to be connected to bufferlist using the pseudo-code // that
> > > > > I
> > > > documented in my previous email.
> > > > > //
> > > > >
> > > > > int main(int argc,char **argv) {
> > > > >
> > > > >    astruct a;
> > > > >    a.a = 10;
> > > > >    a.b.insert(2);
> > > > >    a.b.insert(3);
> > > > >    a.lba = 12;
> > > > >
> > > > >    size_t s = a.enc_dec<ESTIMATE>(size_t(0));
> > > > >    cout << "Estimated size is " << s << "\n";
> > > > >
> > > > >    char buffer[100];
> > > > >
> > > > >    char *end = a.enc_dec<ENCODE>(buffer);
> > > > >
> > > > >    cout << "Actual storage was " << end-buffer << "\n";
> > > > >
> > > > >    astruct b;
> > > > >
> > > > >    (void) b.enc_dec<DECODE>(buffer); // decode it
> > > > >
> > > > >    cout << "A.a = " << b.a << "\n";
> > > > >    for (auto e : b.b) {
> > > > >       cout << " " << e;
> > > > >    }
> > > > >
> > > > >    cout << "\n";
> > > > >
> > > > >    cout << "a.lba = " << b.lba << "\n";
> > > > >
> > > > >    return 0;
> > > > > }
> > > > >
> > > > >
> > > > > Allen Samuels
> > > > > SanDisk |a Western Digital brand
> > > > > 2880 Junction Avenue, San Jose, CA 95134
> > > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > > >
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: Mark Nelson [mailto:mnelson@redhat.com]
> > > > >> Sent: Tuesday, July 12, 2016 8:13 PM
> > > > >> To: Sage Weil <sweil@redhat.com>; Allen Samuels
> > > > >> <Allen.Samuels@sandisk.com>
> > > > >> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > >> Subject: Re: bluestore onode diet and encoding overhead
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 07/12/2016 08:50 PM, Sage Weil wrote:
> > > > >>> On Tue, 12 Jul 2016, Allen Samuels wrote:
> > > > >>>> Good analysis.
> > > > >>>>
> > > > >>>> My original comments about putting the oNode on a diet included
> > > > >>>> the idea of a "custom" encode/decode path for certain high-usage
> > cases.
> > > > >>>> At the time, Sage resisted going down that path hoping that a
> > > > >>>> more optimized generic case would get the job done. Your
> > > > >>>> analysis shows that while we've achieved significant space
> > > > >>>> reduction this has come at the expense of CPU time -- which
> > > > >>>> dominates small object performance (I suspect that eventually
> > > > >>>> we'd discover that the variable length decode path would be
> > > > >>>> responsible for a substantial read performance degradation also
> > > > >>>> -- which may or may not be part of the read performance
> > > > >>>> drop-off that you're seeing). This isn't a
> > > > surprising
> > > > >> result, though it is unfortunate.
> > > > >>>>
> > > > >>>> I believe we need to revisit the idea of custom encode/decode
> > > > >>>> paths for high-usage cases, only now the gains need to be
> > > > >>>> focused on CPU utilization as well as space efficiency.
> > > > >>>
> > > > >>> I still think we can get most or all of the way there in a
> > > > >>> generic way by revising the way that we interact with bufferlist
> > > > >>> for encode and
> > > > decode.
> > > > >>> We haven't actually tried to optimize this yet, and the current
> > > > >>> code is pretty horribly inefficient (asserts all over the place,
> > > > >>> and many layers of pointer indirection to do a simple append).
> > > > >>> I think we need to do two
> > > > >>> things:
> > > > >>>
> > > > >>> 1) decode path: optimize the iterator class so that it has a
> > > > >>> const char *current and const char *current_end that point into
> > > > >>> the current buffer::ptr.  This way any decode will have a single
> > > > >>> pointer
> > > > >>> add+comparison to ensure there is enough data to copy before
> > > > >>> add+falling into
> > > > >>> the slow path (partial buffer, move to next buffer, etc.).
> > > > >>>
> > > > >>
> > > > >> I don't have a good sense yet for how much this is hurting us in
> > > > >> the read path.  We screwed something up in the last couple of
> > > > >> weeks and small
> > > > reads
> > > > >> are quite slow.
> > > > >>
> > > > >>> 2) Having that comparison is still not ideal, but we shoudl
> > > > >>> consider ways to get around that too.  For example, if we know
> > > > >>> that we are going to decode N M-byte things, we could do an
> > > > >>> iterator 'reserve' or 'check' that ensures we have a valid
> > > > >>> pointer for that much and then proceed without checks.  The
> > > > >>> interface here would be tricky, though, since in the slow case
> > > > >>> we'll span buffers and need to magically fall back to a
> > > > >>> different decode path (hard to maintain) or do a temporary copy
> > > > >>> (probably faster but we need to ensure the iterator owns it and
> > > > >>> frees is later).  I'd say this is step 2 and optional; step 1
> > > > >>> will have the most
> > > > >> benefit.
> > > > >>>
> > > > >>> 3) encode path: currently all encode methods take a bufferlist&
> > > > >>> and the bufferlist itself as an append buffer.  I think this is
> > > > >>> flawed and limiting.  Instead, we should make a new class called
> > > > >>> buffer::list::appender (or similar) and templatize the encode
> > > > >>> methods so they can take a safe_appender (which does bounds
> > > > >>> checking) or an unsafe_appender (which does not).  For the
> > > > >>> latter, the user takes responsibility for making sure there is
> > > > >>> enough space by doing a
> > > > >>> reserve() type call which returns an unsafe_appender, and it's
> > > > >>> their job to make sure they don't shove too much data into it.
> > > > >>> That should make the encode path a memcpy + ptr increment (for
> > > > >>> savvy/optimized
> > > > >> callers).
> > > > >>
> > > > >> Seems reasonable and similar in performance to what Piotr and I
> > > > >> were discussing this morning.  As a very simple test I was
> > > > >> thinking of doing a
> > > > quick
> > > > >> size computation and then passing that in to increase the
> > > > >> append_buffer
> > > > size
> > > > >> when the bufferlist is created in Bluestore::_txc_write_nodes.
> > > > >> His idea
> > > > went
> > > > >> a bit farther to break the encapsulation, compute the fully
> > > > >> encoded message, and dump it directly into a buffer of a computed
> > > > >> size without
> > > > the
> > > > >> extra assert checks or bounds checking.  Obviously his idea would
> > > > >> be
> > > > faster
> > > > >> but more work.
> > > > >>
> > > > >> It sounds like your solution would be similar but a bit more formalized.
> > > > >>
> > > > >>>
> > > > >>> I suggest we use bluestore as a test case to make the interfaces
> > > > >>> work and be fast.  If we succeed we can take advantage of it
> > > > >>> across the reset of the code base as well.
> > > > >>
> > > > >> Do we have other places in the code with similar byte append
> > behavior?
> > > > >> That's what's really killing us I think, especially with how
> > > > >> small the new append_buffer is when you run out of space when
> > appending bytes.
> > > > >>
> > > > >>>
> > > > >>> That's my thinking, at least.  I haven't had time to prototype
> > > > >>> it out yet, but I think our goal should be to make the
> > > > >>> encode/decode paths capable of being a memcpy + ptr addition in
> > > > >>> the fast path, and let that guide the interface...
> > > > >>>
> > > > >>> sage
> > > > >>> --
> > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > >>> in the body of a message to majordomo@vger.kernel.org More
> > > > >> majordomo
> > > > >>> info at  http://vger.kernel.org/majordomo-info.html
> > > > >>>
> > >
> > >
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: bluestore onode diet and encoding overhead
  2016-08-13 21:36                 ` Sage Weil
@ 2016-08-14 20:37                   ` Allen Samuels
  0 siblings, 0 replies; 39+ messages in thread
From: Allen Samuels @ 2016-08-14 20:37 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Saturday, August 13, 2016 2:36 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Mark Nelson <mnelson@redhat.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: RE: bluestore onode diet and encoding overhead
> 
> On Fri, 12 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, August 12, 2016 9:18 AM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Mark Nelson <mnelson@redhat.com>; ceph-devel <ceph-
> > > devel@vger.kernel.org>
> > > Subject: RE: bluestore onode diet and encoding overhead
> > >
> > > Okay, I finally had some time to read through this thread!
> > >
> > > On Thu, 14 Jul 2016, Allen Samuels wrote:
> > > > Yes, I did actually run the code before I posted it.
> > > >
> > > > w.r.t. varint encoding. You have two choices w.r.t. a variable
> > > > length encoded, you could examine the data to accurately predict
> > > > the output size OR you could just return a constant that
> > > > represents the worst-case
> > > > (max) size.  For individual fields, it probably doesn't matter
> > > > what you chose, but for fields that are part of something in a
> > > > container, you probably want the option of NOT running down the
> > > > container to size up each element -- so you'd just choose the
> > > > worst-case size for the estimator.
> > > >
> > > > Though this code doesn't show it, I wrote some pseudo-code in a
> > > > previous e-mail that glues this framework into the bufferlist stuff.
> > > > That pseudo code is well prepared for estimate functions that are
> > > > too large (indeed, it expects that to happen) and it naturally
> > > > handles buffer overrun detection.
> > > >
> > > > I didn't describe it in the example, but this framework very
> > > > naturally handles versioning, you just add some code like:
> > > >
> > > > Struct abc {
> > > >    Int version;
> > > >    Int a;
> > > >    Int b;
> > > >    ..... enc_dec(p) {
> > > >       ::enc_dec(p, version);
> > > >       ::enc_dec(p, a);
> > > >       If (s != DECODE || version > 5) ::enc_dec(p, b); // This
> > > > field is present in
> > > all estimate and encode operations, but only in decode operations
> > > when version is > 5
> > > >    }
> > > > };
> > >
> > > This is pretty cool.  I'm mainly nervous about the versioning stuff.
> > > The current encode/decode scheme already has a length (so we're good
> > > there), and also two other fields: struct_v and compat_v, indicating
> > > the version of the encoding and the oldest decoder that can
> > > understand it.  I don't see any reason why that couldn't be
> > > replicated here.  I am a bit nervous about the decoding side,
> > > though, since it can get complicated.  What you have above is the
> > > common case, but even a moderately simple one (where we didn't have
> to do anything kludgey) looks like this:
> > >
> > > 	https://github.com/ceph/ceph/blob/master/src/osd/osd_types.cc#L
> > > 2337
> >
> > Yes, the struct_v and compat_v stuff can be easily handled and I don't think
> the code is any more "complicated" than what's done for the object that you
> showed. In fact the code is almost identical....
> >
> >
> > >
> > > I suspect those conditionals would end up looking like
> > >
> > >  if (s != DECODE || version > 5) {
> > >    ::enc_dec(p, b);
> > >  } if (s == DECODE) {
> > >    b = default/compat value;
> > >  }
> >
> > Almost, I think it's simpler. Let's assume that the case we care about is a
> field that's present only in versions > 5. Then the code is:
> >
> >    if (version > 5) {
> >       ::enc_dec(p,b);
>      } else {
>         b = 1;  // compat value
>      }
> 
> If we set a the version/struct_v var in the encode and estimate paths to the
> latest version, then yeah.  It may read a bit strange because the compat block
> appears will only be visited in the decode path (when struct_v could be
> small), but that should be fine.
> 
> >
> > Which is pretty much what it looks like in the current code...
> >
> >
> > >
> > > I'm still trying to sort out in my head how this relates to the appender
> thing.  I
> > > think they're largely orthogonal, but the estimate function here could be
> > > used to drive the unsafe_appender stuff pretty seamlessly.
> > > Using the unsafe_appender manually is going to be a lot more error-
> prone,
> > > but should get the same performance benefit, without unifying the
> > > encode/decode stuff.
> >
> > Yes, they're conceptually orthogonal. In all cases, we make a
> > CONSERVATIVE estimate of the space that's required. Acquire that space
> > and then encode without checking for overflow. At the end of the encode
> > you can forgive the unused space.
> >
> > >
> > > I'm a bit worried that the estimate process will be too slow, though.  On a
> > > complicated nested object, for example, it will have to traverse the full
> data
> > > structure once to estimate, and then again to encode.  It might be simpler
> > > and faster to have the outer parts of the structure operate on a
> > > safe_encoder, and construct an unsafe_encoder only when we are
> explicitly
> > > prepared to do the estimate.  For example, we can have a safe_encoder
> > > method like
> >
> > These are exactly the right tradeoffs. The framework allows you do
> > specify for particular containers whether you need to walk the container
> > or not to make an estimate. The optimization is only important for
> > containers with large numbers of small objects -- in any other case
> > (small numbers, large objects....) the overhead isn't important to
> > optimize out.
> >
> > BTW, there's nothing the framework that requires you to do the estimate
> > process once for the whole world, you can absolutely doit in piecemeal
> > like you suggest.
> 
> I'm having trouble imagining what this code is going to look like.  How
> close is your branch to a point where I can start playing with it?

Sure. http://github.com/allensamuels/ceph.git

> 
> > IMO, the most important thing about the estimate/encode cycle is that
> > the estimate NEVER be wrong. I was very worried about a design pattern
> > that separated the estimation process from the actual encoding process
> > in a way that would allow this kind of error to creep in. What I'm
> > worried about is an endless series of production run-time failures due
> > to incorrect estimation for low probability data-dependent cases -- in
> > other words, I consider it a requirement of the design that the
> > estimation process be so tightly linked to the encoding process that
> > possibilities of rare overruns are essentially eliminated by
> > construction. So, I ended up with a design pattern that -- by default --
> > is guaranteed to "get a good enough" answer [i.e., conservative] at the
> > expensive of time -- but you can selectively override the default in way
> > that I think is pretty safe to get that time back when it matters.
> 
> Yeah--I agree here.
> 
> sage
> 
> > BTW, I expect most enc_dec calls for ESTIMATION to compile to just a few
> > multiplies and adds for anything but a container you have to walk.
> >
> >
> > >
> > >   unsafe_encoder reserve(size_t s);
> > >
> > > so that we can do
> > >
> > >   void encode(bufferlist::safe_appender& ap) const {
> > >     ENCODE_START(2, 2);
> > >     // do a single range check for all of our simple members
> > >     {
> > >       unsafe_encoder t = ap.reserve(5 * sizeof(uint64_t));
> > >       ::encode(foo, t);
> > >       ::encode(bar, t);
> > >       ::encode(baz, t);
> > >       ::encode(a, t);
> > >       ::encode(b, t);
> > >     }
> > >     // use the safe encoder for some complex ones
> > >     ::encode(widget, ap);
> > >     ::encode(widget2, ap);
> > >     // explicitly estimate a simple container
> > >     {
> > >       unsafe_encoder t = ap.reserve(sub.size() * known_worst_case);
> > >       ::encode(sub, t);
> > >     }
> > >     // dynamically range check a complex container
> > >     ::encode(complex_container, ap);
> > >   }
> > >
> > > The enc_dec currently forces a full estimate in all cases, even when it's
> not
> > > really needed.  Perhaps we can come up with some set of templates and
> > > wrapper functions so that we can use safe and unsafe encoders
> somewhat
> > > interchangeably so that the estimate infrastructure is only triggered
> when
> > > needed?
> >
> > I think the characterization of "full estimate in all cases, even when it's not
> really needed" isn't correct. If you code the above using my framework then
> for all except the two "complex" containers, the estimation is pretty simple,
> just a few adds and multiplies.
> >
> > It is true for the "complex" containers that you'll walk the container twice,
> once for the estimate and once for the encode. However, that's the only
> difference, the actual cost of encoding the elements of the two containers is
> the same (i.e., a reserve of the size followed by the encoding of the
> container itself) the only delta is that overhead of the for_each loop itself --
> the cost of walking the container. I believe that if the container is "complex",
> then the cost of walking it, when compared to the cost of checking the buffer
> size and doing the per-element encode is minimal. In other words, I disagree
> about the cost of the estimate phase (you're doing the same work just re-
> distributed, I only walk the container itself once extra).
> >
> >
> >
> > >
> > > sage
> > >
> > >
> > >
> > > > What this framework doesn't yet handle very well is situations where
> > > > you have a container with a contained type that is a primitive (i.e.,
> > > > uint8) and you want that contained type to be custom encoded.
> > > > Currently, the only solution is replace the contained primitive type
> > > > with a class wrapper. Unfortunately a typedef is NOT sufficient to
> > > differentiate it.
> > > >
> > > > Allen Samuels
> > > > SanDisk |a Western Digital brand
> > > > 2880 Junction Avenue, San Jose, CA 95134
> > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Mark Nelson [mailto:mnelson@redhat.com]
> > > > > Sent: Thursday, July 14, 2016 4:16 AM
> > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> > > > > <sweil@redhat.com>
> > > > > Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > > Subject: Re: bluestore onode diet and encoding overhead
> > > > >
> > > > > On 07/14/2016 12:52 AM, Allen Samuels wrote:
> > > > > > As promised, here's some code that hacks out a new
> encode/decode
> > > > > framework. That has the advantage of only having to list the fields
> > > > > of a struct once and is pretty much guaranteed to never overrun a
> > > buffer....
> > > > > >
> > > > > > Comments are requested :)
> > > > >
> > > > > It compiles! :D
> > > > >
> > > > > I looked over the code, but I want to look it over again after I've
> > > > > had my coffee since I'm still shaking the cobwebs out.  Would the
> > > > > idea here be that if you are doing varint encoding for example that
> > > > > you always allocate the buffer based on ESTIMATE (also taking into
> > > > > account the encoding overhead), but typically expect a much smaller
> > > encoding?
> > > > >
> > > > > As it is, it's very clever.
> > > > >
> > > > > Mark
> > > > >
> > > > > >
> > > > > >
> > > > > > #include <iostream>
> > > > > > #include <fstream>
> > > > > > #include <set>
> > > > > > #include <string>
> > > > > > #include <string.h>
> > > > > >
> > > > > >
> > > /*******************************************************
> > > > > >
> > > > > >
> > > > > >    New fast encode/decode framework.
> > > > > >
> > > > > >    The entire framework is built around the idea that each object
> > > > > > has three
> > > > > operations:
> > > > > >
> > > > > >      ESTIMATE  -- worst-case estimate of the amount of storage
> > > > > > required for
> > > > > this object
> > > > > >      ENCODE    -- encode object into buffer of size ESTIMATE
> > > > > >      DECODE    -- encode object from buffer of size actual.
> > > > > >
> > > > > >    Each object has a single templated function that actually
> > > > > > provides all three
> > > > > operations in a single set of code.
> > > > > >    But doing this, it's pretty much guaranteed that the ESTIMATE
> > > > > > and the
> > > > > ENCODE code are in harmony (i.e. that the estimate is correct)
> > > > > >    it also saves a lot of typing/reading...
> > > > > >
> > > > > >    Generally, all three operations are provided on a single
> > > > > > function name
> > > > > with the input and return parameters overloaded to distinguish them.
> > > > > >
> > > > > >    It's observed that for each of the three operations there is a
> > > > > > single value
> > > > > which needs to be transmitted between each of the
> > > > > micro-encode/decode calls
> > > > > >    Yes, this is confusing, but let's look at a simple example
> > > > > >
> > > > > >     struct simple {
> > > > > >       int a;
> > > > > >       float b;
> > > > > >       string c;
> > > > > >       set<int> d;
> > > > > >     };
> > > > > >
> > > > > >     To encode this struct we generate a function that does the
> > > > > > micro-
> > > > > encoding of each of the fields of the struct
> > > > > >     Here's an example of a function that does the ESTIMATE
> operation.
> > > > > >
> > > > > >     size_t simple::estimate() {
> > > > > >        return
> > > > > >           sizeof(a) +
> > > > > >           sizeof(b) +
> > > > > >           c.size() +
> > > > > >           d.size() * sizeof(int);
> > > > > >     }
> > > > > >
> > > > > >     We're going to re-write it as:
> > > > > >
> > > > > >     size_t simple::estimate(size_t p) {
> > > > > >        p = estimate(p,a);
> > > > > >        p = estimate(p,b);
> > > > > >        p = estimate(p,c);
> > > > > >        p = estimate(p,d);
> > > > > >        return p;
> > > > > >     }
> > > > > >
> > > > > >     assuming that the sorta function:
> > > > > >
> > > > > >     template<typename t> size_t estimate(size_t p,t& o) { return p
> > > > > > +
> > > > > sizeof(o); }
> > > > > >     template<typename t> size_t estimate(size_t p,set<t>& o) {
> > > > > > return p + o.size() * sizeof(t); }
> > > > > >
> > > > > >
> > > > > >     similarly, the encode operation is represented as:
> > > > > >
> > > > > >     char * simple::encode(char *p) {
> > > > > >        p = encode(p,a);
> > > > > >        p = encode(p,b);
> > > > > >        p = encode(p,c);
> > > > > >        p = encode(p,d);
> > > > > >        return p;
> > > > > >     }
> > > > > >
> > > > > >     similarly, the decode operation is represented as:
> > > > > >
> > > > > >     const char * simple::decode(const char *p) {
> > > > > >        p = decode(p,a);
> > > > > >        p = decode(p,b);
> > > > > >        p = decode(p,c);
> > > > > >        p = decode(p,d);
> > > > > >        return p;
> > > > > >     }
> > > > > >
> > > > > >
> > > > > > You can now see that it's possible to create a single function
> > > > > > that does all three operations in a single block of code, provided
> > > > > > that you can
> > > > > fiddle the input/output parameter types appropriately.
> > > > > >
> > > > > > In essence the pattern is
> > > > > >
> > > > > >     p = enc_dec(p,struct_field_1);
> > > > > >     p = enc_dec(p,struct_field_2);
> > > > > >     p = enc_dec(p,struct_field_3);
> > > > > >
> > > > > > With the type of p being set differently for each operation, i.e.,
> > > > > >     for ESTIMATE, p = size_t
> > > > > >     for ENCODE,   p = char *
> > > > > >     for DECODE,   p = const char *
> > > > > >
> > > > > > This is the essence of how the encode/decode framework
> operates.
> > > > > Though there is some more sophistication...
> > > > > >
> > > > > > ----------------------
> > > > > >
> > > > > > We also want to allow the encode/decode machinery to be per-
> type
> > > > > > and to operate
> > > > > >
> > > > > >
> > > > >
> > >
> **********************************************************
> > > > > ************
> > > > > > *******/
> > > > > >
> > > > > > using namespace std;
> > > > > >
> > > > > > //
> > > > > > // Just like the existing encode/decode machinery. The
> environment
> > > > > > provides a rich set of // pre-defined encodes for primitive types
> > > > > > and containers //
> > > > > >
> > > > > > #define DEFINE_ENC_DEC_RAW(type) \
> > > > > > inline size_t      enc_dec(size_t p,type &o)      { return p +
> sizeof(type); } \
> > > > > > inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return
> p +
> > > > > sizeof(type); } \
> > > > > > inline const char *enc_dec(const char *p,type &o) { o = *(const
> > > > > > type *)p; return p + sizeof(type); }
> > > > > >
> > > > > > DEFINE_ENC_DEC_RAW(int);
> > > > > > DEFINE_ENC_DEC_RAW(size_t);
> > > > > >
> > > > > > //
> > > > > > // String encode/decode (Yea, I know size_t isn't portable -- this
> > > > > > is an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s)
> > > > > > { return p + sizeof(size_t) + s.size(); } inline char *
> > > > > > enc_dec(char * p,string& s) { *(size_t *)p = s.size();
> > > > > > memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p +
> > > > > > sizeof(size_t)
> > > > > > + s.size(); } inline const char *enc_dec(const char *p,string& s)
> > > > > > + { s
> > > > > > = string(p + sizeof(size_t),*(size_t *)p); return p +
> > > > > > sizeof(size_t) + s.size(); }
> > > > > >
> > > > > > //
> > > > > > // Let's do a container.
> > > > > > //
> > > > > > // One of the problems with a container is that making an accurate
> > > > > > estimate of the size // would theoretically require that you walk
> > > > > > the entire
> > > > > container and add up the sizes of each element.
> > > > > > // We probably don't want to do that. So here, I do a hack that
> > > > > > just assumes that I can fake up a individual element // and
> > > > > > multiple that by the number of elements in a container. This hack
> > > > > > works anytime that the estimate function // for the contained type
> has
> > > a fixed maximum size.
> > > > > BTW, this is safe, if the contained type has a variable size //
> > > > > (like set<string>) then it will fault out the first time you run it.
> > > > > > //
> > > > > > // Naturally, something like set<string> or map<string,string> is
> > > > > > a highly desirable thing to be able to encode/decode // there's no
> > > > > > reason
> > > > > that you can't create a enc_dec_slow function that properly
> computes
> > > > > the maximum size by walking the container.
> > > > > > //
> > > > > > template<typename t>
> > > > > > inline size_t enc_dec(size_t p,set<t>& s) { return p +
> > > > > > sizeof(size_t)
> > > > > > + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }
> > > > > >
> > > > > > template<typename t>
> > > > > > inline char *enc_dec(char *p,set<t>& s) {
> > > > > >    size_t sz = s.size();
> > > > > >    p = enc_dec(p,sz);
> > > > > >    for (const t& e : s) {
> > > > > >       p = enc_dec(p,const_cast<t&>(e));
> > > > > >    }
> > > > > >    return p;
> > > > > > }
> > > > > >
> > > > > > template<typename t>
> > > > > > inline const char *enc_dec(const char *p,set<t>&s) {
> > > > > >    size_t sz;
> > > > > >    p = enc_dec(p,sz);
> > > > > >    while (sz--) {
> > > > > >       t temp;
> > > > > >       p = enc_dec(p,temp);
> > > > > >       s.insert(temp);
> > > > > >    }
> > > > > >    return p;
> > > > > > }
> > > > > >
> > > > > > //
> > > > > > // Specialized encode/decode for a single data type. These are
> > > > > > invoked
> > > > > explicitly...
> > > > > > //
> > > > > > inline size_t enc_dec_lba(size_t p,int& lba) {
> > > > > >    return p + sizeof(lba); // Max....
> > > > > > }
> > > > > >
> > > > > > inline char * enc_dec_lba(char *p,int& lba) {
> > > > > >    *p = 15;
> > > > > >    return p + 1; // blah blah
> > > > > > }
> > > > > >
> > > > > > inline const char *enc_dec_lba(const char *p,int& lba) {
> > > > > >    lba = *p;
> > > > > >    return p+1;
> > > > > > }
> > > > > >
> > > > > > //
> > > > > > // Specialized encode/decode for more sophisticated things
> primitives.
> > > > > > //
> > > > > > // Here's an example of a encode/decoder for a pair of fields //
> > > > > > inline size_t enc_dec_range(size_t p,short& start,short& end) {
> > > > > >    return p + 2 * sizeof(short);
> > > > > > }
> > > > > >
> > > > > > inline char *enc_dec_range(char *p, short& start, short& end) {
> > > > > >    short *s = (short *) p;
> > > > > >    s[0] = start;
> > > > > >    s[1] = end;
> > > > > >    return p + sizeof(short) * 2;
> > > > > > }
> > > > > >
> > > > > > inline const char *enc_dec_range(const char *p,short& start,
> short&
> > > end) {
> > > > > >    start = *(short *)p;
> > > > > >    end   = *(short *)(p + sizeof(short));
> > > > > >    return p + 2*sizeof(short);
> > > > > > }
> > > > > >
> > > > > >
> > > > > > //
> > > > > > // Some C++ template wizardry to make the single encode/decode
> > > > > function possible.
> > > > > > //
> > > > > > enum SERIAL_TYPE {
> > > > > >    ESTIMATE,
> > > > > >    ENCODE,
> > > > > >    DECODE
> > > > > > };
> > > > > >
> > > > > > template <enum SERIAL_TYPE s> struct serial_type;
> > > > > >
> > > > > > template<> struct serial_type<ESTIMATE> { typedef size_t type; };
> > > > > > template<> struct serial_type<ENCODE>   { typedef char * type; };
> > > > > > template<> struct serial_type<DECODE>   { typedef const char
> *type; };
> > > > > >
> > > > > > //
> > > > > > // This macro is the key, it connects the external non-member
> > > > > > function to
> > > > > the correct member function.
> > > > > > //
> > > > > > #define DEFINE_STRUCT_ENC_DEC(s) \
> > > > > > inline size_t      enc_dec(size_t p, s &o) { return
> > > o.enc_dec<ESTIMATE>(p); }
> > > > > \
> > > > > > inline char *      enc_dec(char *p , s &o)  { return
> > > o.enc_dec<ENCODE>(p); }
> > > > > \
> > > > > > inline const char *enc_dec(const char *p,s &o)  { return
> > > > > > o.enc_dec<DECODE>(p); }
> > > > > >
> > > > > > //
> > > > > > // Our example structure
> > > > > > //
> > > > > > struct astruct {
> > > > > >    int a;
> > > > > >    set<int> b;
> > > > > >    int lba;
> > > > > >    short start,end;
> > > > > >
> > > > > >    //
> > > > > >    // <<<<< You need to provide this function just one.
> > > > > >    //
> > > > > >    template<enum SERIAL_TYPE s> typename serial_type<s>::type
> > > > > enc_dec(typename serial_type<s>::type p) {
> > > > > >       p = ::enc_dec(p,a);
> > > > > >       p = ::enc_dec(p,b);
> > > > > >       p = ::enc_dec_lba(p,lba);
> > > > > >       p = ::enc_dec_range(p,start,end);
> > > > > >       return p;
> > > > > >    }
> > > > > > };
> > > > > >
> > > > > > //
> > > > > > // This macro connects the global enc_dec to the member function.
> > > > > > // One of these per struct declaration //
> > > > > > DEFINE_STRUCT_ENC_DEC(astruct);
> > > > > >
> > > > > >
> > > > > > //
> > > > > > // Here's a simple test program. The real encode/decode
> framework
> > > > > > needs to be connected to bufferlist using the pseudo-code // that
> > > > > > I
> > > > > documented in my previous email.
> > > > > > //
> > > > > >
> > > > > > int main(int argc,char **argv) {
> > > > > >
> > > > > >    astruct a;
> > > > > >    a.a = 10;
> > > > > >    a.b.insert(2);
> > > > > >    a.b.insert(3);
> > > > > >    a.lba = 12;
> > > > > >
> > > > > >    size_t s = a.enc_dec<ESTIMATE>(size_t(0));
> > > > > >    cout << "Estimated size is " << s << "\n";
> > > > > >
> > > > > >    char buffer[100];
> > > > > >
> > > > > >    char *end = a.enc_dec<ENCODE>(buffer);
> > > > > >
> > > > > >    cout << "Actual storage was " << end-buffer << "\n";
> > > > > >
> > > > > >    astruct b;
> > > > > >
> > > > > >    (void) b.enc_dec<DECODE>(buffer); // decode it
> > > > > >
> > > > > >    cout << "A.a = " << b.a << "\n";
> > > > > >    for (auto e : b.b) {
> > > > > >       cout << " " << e;
> > > > > >    }
> > > > > >
> > > > > >    cout << "\n";
> > > > > >
> > > > > >    cout << "a.lba = " << b.lba << "\n";
> > > > > >
> > > > > >    return 0;
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Allen Samuels
> > > > > > SanDisk |a Western Digital brand
> > > > > > 2880 Junction Avenue, San Jose, CA 95134
> > > > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > > > >
> > > > > >
> > > > > >> -----Original Message-----
> > > > > >> From: Mark Nelson [mailto:mnelson@redhat.com]
> > > > > >> Sent: Tuesday, July 12, 2016 8:13 PM
> > > > > >> To: Sage Weil <sweil@redhat.com>; Allen Samuels
> > > > > >> <Allen.Samuels@sandisk.com>
> > > > > >> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > > >> Subject: Re: bluestore onode diet and encoding overhead
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 07/12/2016 08:50 PM, Sage Weil wrote:
> > > > > >>> On Tue, 12 Jul 2016, Allen Samuels wrote:
> > > > > >>>> Good analysis.
> > > > > >>>>
> > > > > >>>> My original comments about putting the oNode on a diet
> included
> > > > > >>>> the idea of a "custom" encode/decode path for certain high-
> usage
> > > cases.
> > > > > >>>> At the time, Sage resisted going down that path hoping that a
> > > > > >>>> more optimized generic case would get the job done. Your
> > > > > >>>> analysis shows that while we've achieved significant space
> > > > > >>>> reduction this has come at the expense of CPU time -- which
> > > > > >>>> dominates small object performance (I suspect that eventually
> > > > > >>>> we'd discover that the variable length decode path would be
> > > > > >>>> responsible for a substantial read performance degradation also
> > > > > >>>> -- which may or may not be part of the read performance
> > > > > >>>> drop-off that you're seeing). This isn't a
> > > > > surprising
> > > > > >> result, though it is unfortunate.
> > > > > >>>>
> > > > > >>>> I believe we need to revisit the idea of custom encode/decode
> > > > > >>>> paths for high-usage cases, only now the gains need to be
> > > > > >>>> focused on CPU utilization as well as space efficiency.
> > > > > >>>
> > > > > >>> I still think we can get most or all of the way there in a
> > > > > >>> generic way by revising the way that we interact with bufferlist
> > > > > >>> for encode and
> > > > > decode.
> > > > > >>> We haven't actually tried to optimize this yet, and the current
> > > > > >>> code is pretty horribly inefficient (asserts all over the place,
> > > > > >>> and many layers of pointer indirection to do a simple append).
> > > > > >>> I think we need to do two
> > > > > >>> things:
> > > > > >>>
> > > > > >>> 1) decode path: optimize the iterator class so that it has a
> > > > > >>> const char *current and const char *current_end that point into
> > > > > >>> the current buffer::ptr.  This way any decode will have a single
> > > > > >>> pointer
> > > > > >>> add+comparison to ensure there is enough data to copy before
> > > > > >>> add+falling into
> > > > > >>> the slow path (partial buffer, move to next buffer, etc.).
> > > > > >>>
> > > > > >>
> > > > > >> I don't have a good sense yet for how much this is hurting us in
> > > > > >> the read path.  We screwed something up in the last couple of
> > > > > >> weeks and small
> > > > > reads
> > > > > >> are quite slow.
> > > > > >>
> > > > > >>> 2) Having that comparison is still not ideal, but we shoudl
> > > > > >>> consider ways to get around that too.  For example, if we know
> > > > > >>> that we are going to decode N M-byte things, we could do an
> > > > > >>> iterator 'reserve' or 'check' that ensures we have a valid
> > > > > >>> pointer for that much and then proceed without checks.  The
> > > > > >>> interface here would be tricky, though, since in the slow case
> > > > > >>> we'll span buffers and need to magically fall back to a
> > > > > >>> different decode path (hard to maintain) or do a temporary copy
> > > > > >>> (probably faster but we need to ensure the iterator owns it and
> > > > > >>> frees is later).  I'd say this is step 2 and optional; step 1
> > > > > >>> will have the most
> > > > > >> benefit.
> > > > > >>>
> > > > > >>> 3) encode path: currently all encode methods take a bufferlist&
> > > > > >>> and the bufferlist itself as an append buffer.  I think this is
> > > > > >>> flawed and limiting.  Instead, we should make a new class called
> > > > > >>> buffer::list::appender (or similar) and templatize the encode
> > > > > >>> methods so they can take a safe_appender (which does bounds
> > > > > >>> checking) or an unsafe_appender (which does not).  For the
> > > > > >>> latter, the user takes responsibility for making sure there is
> > > > > >>> enough space by doing a
> > > > > >>> reserve() type call which returns an unsafe_appender, and it's
> > > > > >>> their job to make sure they don't shove too much data into it.
> > > > > >>> That should make the encode path a memcpy + ptr increment
> (for
> > > > > >>> savvy/optimized
> > > > > >> callers).
> > > > > >>
> > > > > >> Seems reasonable and similar in performance to what Piotr and I
> > > > > >> were discussing this morning.  As a very simple test I was
> > > > > >> thinking of doing a
> > > > > quick
> > > > > >> size computation and then passing that in to increase the
> > > > > >> append_buffer
> > > > > size
> > > > > >> when the bufferlist is created in Bluestore::_txc_write_nodes.
> > > > > >> His idea
> > > > > went
> > > > > >> a bit farther to break the encapsulation, compute the fully
> > > > > >> encoded message, and dump it directly into a buffer of a
> computed
> > > > > >> size without
> > > > > the
> > > > > >> extra assert checks or bounds checking.  Obviously his idea would
> > > > > >> be
> > > > > faster
> > > > > >> but more work.
> > > > > >>
> > > > > >> It sounds like your solution would be similar but a bit more
> formalized.
> > > > > >>
> > > > > >>>
> > > > > >>> I suggest we use bluestore as a test case to make the interfaces
> > > > > >>> work and be fast.  If we succeed we can take advantage of it
> > > > > >>> across the reset of the code base as well.
> > > > > >>
> > > > > >> Do we have other places in the code with similar byte append
> > > behavior?
> > > > > >> That's what's really killing us I think, especially with how
> > > > > >> small the new append_buffer is when you run out of space when
> > > appending bytes.
> > > > > >>
> > > > > >>>
> > > > > >>> That's my thinking, at least.  I haven't had time to prototype
> > > > > >>> it out yet, but I think our goal should be to make the
> > > > > >>> encode/decode paths capable of being a memcpy + ptr addition
> in
> > > > > >>> the fast path, and let that guide the interface...
> > > > > >>>
> > > > > >>> sage
> > > > > >>> --
> > > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > > > > >>> in the body of a message to majordomo@vger.kernel.org More
> > > > > >> majordomo
> > > > > >>> info at  http://vger.kernel.org/majordomo-info.html
> > > > > >>>
> > > >
> > > >
> >
> >

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2016-08-14 20:37 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-12  7:03 bluestore onode diet and encoding overhead Mark Nelson
2016-07-12  7:13 ` Somnath Roy
2016-07-12 12:34   ` Mark Nelson
2016-07-12 12:40     ` Igor Fedotov
2016-07-12 12:47       ` Varada Kari
2016-07-12 12:48       ` Mark Nelson
2016-07-12 12:57         ` Igor Fedotov
2016-07-12 13:02           ` Mark Nelson
2016-07-12 15:14             ` Somnath Roy
2016-07-12 15:31               ` Igor Fedotov
2016-07-12 15:36                 ` Somnath Roy
2016-07-12 15:46                   ` Mark Nelson
2016-07-12 20:48                     ` Mark Nelson
2016-07-12 15:37               ` Varada Kari
2016-07-12 16:56               ` Sage Weil
2016-07-12 16:57                 ` Sage Weil
2016-07-12 17:06                   ` Somnath Roy
2016-07-12 17:50                 ` Allen Samuels
2016-07-12 15:20 ` Allen Samuels
2016-07-12 15:37   ` Mark Nelson
2016-07-12 21:15     ` Allen Samuels
2016-07-12 22:04       ` Mark Nelson
2016-07-13  1:50   ` Sage Weil
2016-07-13  3:13     ` Mark Nelson
2016-07-13  6:33       ` Piotr Dałek
2016-07-13 16:05         ` Sage Weil
2016-07-13 21:29           ` Allen Samuels
2016-07-14  5:52       ` Allen Samuels
2016-07-14 11:15         ` Mark Nelson
2016-07-14 14:10           ` Allen Samuels
2016-08-12 16:18             ` Sage Weil
2016-08-12 22:25               ` Allen Samuels
2016-08-13 21:36                 ` Sage Weil
2016-08-14 20:37                   ` Allen Samuels
2016-07-14 14:14           ` Allen Samuels
2016-07-14 16:20           ` Allen Samuels
2016-07-14 16:31             ` Mark Nelson
2016-07-14 16:34               ` Allen Samuels
2016-07-13 14:47     ` Samuel Just

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.