* Re: Some thoughts regarding the new store
2015-05-27 9:46 ` Haomai Wang
@ 2015-05-27 11:56 ` Mark Nelson
2015-05-27 21:14 ` Sage Weil
1 sibling, 0 replies; 4+ messages in thread
From: Mark Nelson @ 2015-05-27 11:56 UTC (permalink / raw)
To: Haomai Wang, Li Wang; +Cc: Sage Weil, Samuel Just, ceph-devel
On 05/27/2015 04:46 AM, Haomai Wang wrote:
> On Wed, May 27, 2015 at 4:41 PM, Li Wang <liwang@ubuntukylin.com> wrote:
>> I have just noticed the new store development, and had a
>> look at the idea behind it (http://www.spinics.net/lists/ceph-
>> devel/msg22712.html), so my understanding, we wanna avoid the
>> double-write penalty of WRITE_AHEAD_LOGGING journal mechanism,
>> the straightforward thought is to optimize CREATE, APPEND and
>> FULL-OBJECT-OVERWRITE by writing into new files directly,
>> then update the metadata in a transaction. Other changes include:
>> move the object metadata from filesystem extend attrbutes into
>> key value database; map an object into possibly multiple files.
>>
>> If my understanding is correct, then it seems there follows some issues,
>>
>> 1 Garbage collection is needed to reclaim orphan files generated
>> from crashing;
>
> Yes, but still now we haven't dive into this problem. Because
> currently newstore only allow one object one file.
>
> Anyway I guess GC isn't a big problem. journal keys should be help, is
> something I missed here?
>
>>
>> 2 On spinning disks, it loses the advantages that journal makes random
>> writes into sequential writes, then commits them in groups and
>> leverages another disk to hide the committing delay.
>>
>
> We need to clarify something here, for small random write workload,
> newstore still need journal to make durable and shorter latency.
>
> Although filejournal make use of write ahead to improve performance,
> but journal is far away from data location in disk(partition or
> preallocation file). We always need to write data to disk and the seek
> distance is long I think. For newstore, actually in my best wish
> journal and data could be in one allocation group in local filesystem
> concept(it may be difficult though), just like a ideal fragment
> implementation as expected. In other word, fragment should be
> something to aggregate small writes, but we haven't make it done as
> expected.
>
> Although now newstore's random write performance is bad than
> filestore, I think it's not related to design. We still have lots of
> things could be apply to improve.
FWIW, newstore was looking as good or better for RBD random writes in
the last set of tests I did:
http://nhm.ceph.com/newstore/8c8c5903_rbd_rados_tests.pdf
>
>> 3 OVERWRITE theoretically does not benefit from this design, and the
>> introducing of fragment, increases the object metadata overhead. The
>> possibly mapping of multiple files may also slow down the object
>> read/write performance. OVERWRITE is the major scenario for RBD,
>> consequently, for cloud environment.
>
> yes, we need to handle this thing. Actually for one object mapping to
> multi file, we doesn't have a design(@sage yes? or I missed?). We may
> could think of a solution to make tradeoff :-)
This is the biggest issue holding us back right now imho. If you look
at the linked graphs above, the only place we are really significantly
behind filestore is on semi-large partial object overwrites. I suspect
we'll have to create fragments down to some size (maybe 512k?). There
was some discussion about all of this a couple of weeks ago at the
weekly perf meeting.
>
>
>>
>> 4 By mapping an object into multiple files, potentially we can optimize
>> OVERWRITE by turning it also into APPEND by using small fragments,
>> that, actually mimic Btrfs. However, for many small writes, it may
>> leave many small files in the backend local file system, that may slow
>> down the object read/write performance, especially on spinning
>> disk. More importantly, I think it, to some extent, against the
>> philosophy of object storage, which uses a big object to store data to
>> reduce the metadata cost, and leaves the block management for local
>> file system. For a local file system, big file performance is generally
>> better than small file. If we introduce fragment, it looks like the
>> object storage self cares about the object data allocation now.
>>
>> What is the community's option?
The cost of large partial overwrites in newstore is pretty expensive. I
suspect we'll both need to improve how rocksdb handles it's WAL and I
think introduce at least semi decently sized fragments. A simpler
alternative might be to reduce the default RBD block size and try to
optimize for that case. In the report I linked above there are rados
bench tests at different object sizes to try to get an idea of how rbd
performance at different block sizes might be bound.
>
> Anyway, I think the core idea is we make newstore better than
> filestore in most of workloads.
I agree. I think already it's showing significant enough improvement in
enough cases that it's worth continuing to invest in.
>
>>
>> Cheers,
>> Li Wang
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Some thoughts regarding the new store
2015-05-27 9:46 ` Haomai Wang
2015-05-27 11:56 ` Mark Nelson
@ 2015-05-27 21:14 ` Sage Weil
1 sibling, 0 replies; 4+ messages in thread
From: Sage Weil @ 2015-05-27 21:14 UTC (permalink / raw)
To: Haomai Wang; +Cc: Li Wang, Samuel Just, ceph-devel
On Wed, 27 May 2015, Haomai Wang wrote:
> On Wed, May 27, 2015 at 4:41 PM, Li Wang <liwang@ubuntukylin.com> wrote:
> > I have just noticed the new store development, and had a
> > look at the idea behind it (http://www.spinics.net/lists/ceph-
> > devel/msg22712.html), so my understanding, we wanna avoid the
> > double-write penalty of WRITE_AHEAD_LOGGING journal mechanism,
> > the straightforward thought is to optimize CREATE, APPEND and
> > FULL-OBJECT-OVERWRITE by writing into new files directly,
> > then update the metadata in a transaction. Other changes include:
> > move the object metadata from filesystem extend attrbutes into
> > key value database; map an object into possibly multiple files.
> >
> > If my understanding is correct, then it seems there follows some issues,
> >
> > 1 Garbage collection is needed to reclaim orphan files generated
> > from crashing;
>
> Yes, but still now we haven't dive into this problem. Because
> currently newstore only allow one object one file.
>
> Anyway I guess GC isn't a big problem. journal keys should be help, is
> something I missed here?
Currently we are sloppy. We could add an additional key on each commit to
make sure that we don't leak fragments on crash, but honestly I'm not sure
it's worth the effort given that it's such a tiny amount of space.
sage
^ permalink raw reply [flat|nested] 4+ messages in thread