Some thoughts regarding the new store

All of lore.kernel.org
 help / color / mirror / Atom feed

* Some thoughts regarding the new store
@ 2015-05-27  8:41 Li Wang
  2015-05-27  9:46 ` Haomai Wang
  0 siblings, 1 reply; 4+ messages in thread
From: Li Wang @ 2015-05-27  8:41 UTC (permalink / raw)
  To: Sage Weil, Samuel Just, ceph-devel

I have just noticed the new store development, and had a
look at the idea behind it (http://www.spinics.net/lists/ceph-
devel/msg22712.html), so my understanding, we wanna avoid the
double-write penalty of WRITE_AHEAD_LOGGING journal mechanism,
the straightforward thought is to optimize CREATE, APPEND and
FULL-OBJECT-OVERWRITE by writing into new files directly,
then update the metadata in a transaction. Other changes include:
move the object metadata from filesystem extend attrbutes into
key value database; map an object into possibly multiple files.

If my understanding is correct, then it seems there follows some issues,

1 Garbage collection is needed to reclaim orphan files generated
from crashing;

2 On spinning disks, it loses the advantages that journal makes random
  writes into sequential writes, then commits them in groups and
leverages another disk to hide the committing delay.

3 OVERWRITE theoretically does not benefit from this design, and the
introducing of fragment, increases the object metadata overhead. The 
possibly mapping of multiple files may also slow down the object
read/write performance. OVERWRITE is the major scenario for RBD,
consequently, for cloud environment.

4 By mapping an object into multiple files, potentially we can optimize
OVERWRITE by turning it also into APPEND by using small fragments,
that, actually mimic Btrfs. However, for many small writes, it may
leave many small files in the backend local file system, that may slow
down the object read/write performance, especially on spinning
disk. More importantly, I think it, to some extent, against the
philosophy of object storage, which uses a big object to store data to
reduce the metadata cost, and leaves the block management for local
file system. For a local file system, big file performance is generally
better than small file. If we introduce fragment, it looks like the
object storage self cares about the object data allocation now.

What is the community's option?

Cheers,
Li Wang

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Some thoughts regarding the new store
  2015-05-27  8:41 Some thoughts regarding the new store Li Wang
@ 2015-05-27  9:46 ` Haomai Wang
  2015-05-27 11:56   ` Mark Nelson
  2015-05-27 21:14   ` Sage Weil
  0 siblings, 2 replies; 4+ messages in thread
From: Haomai Wang @ 2015-05-27  9:46 UTC (permalink / raw)
  To: Li Wang; +Cc: Sage Weil, Samuel Just, ceph-devel

On Wed, May 27, 2015 at 4:41 PM, Li Wang <liwang@ubuntukylin.com> wrote:
> I have just noticed the new store development, and had a
> look at the idea behind it (http://www.spinics.net/lists/ceph-
> devel/msg22712.html), so my understanding, we wanna avoid the
> double-write penalty of WRITE_AHEAD_LOGGING journal mechanism,
> the straightforward thought is to optimize CREATE, APPEND and
> FULL-OBJECT-OVERWRITE by writing into new files directly,
> then update the metadata in a transaction. Other changes include:
> move the object metadata from filesystem extend attrbutes into
> key value database; map an object into possibly multiple files.
>
> If my understanding is correct, then it seems there follows some issues,
>
> 1 Garbage collection is needed to reclaim orphan files generated
> from crashing;

Yes, but still now we haven't dive into this problem. Because
currently newstore only allow one object one file.

Anyway I guess GC isn't a big problem. journal keys should be help, is
something I missed here?

>
> 2 On spinning disks, it loses the advantages that journal makes random
>  writes into sequential writes, then commits them in groups and
> leverages another disk to hide the committing delay.
>

We need to clarify something here, for small random write workload,
newstore still need journal to make durable and shorter latency.

Although filejournal make use of write ahead to improve performance,
but journal is far away from data location in disk(partition or
preallocation file). We always need to write data to disk and the seek
distance is long I think. For newstore, actually in my best wish
journal and data could be in one allocation group in local filesystem
concept(it may be difficult though), just like a ideal fragment
implementation as expected. In other word, fragment should be
something to aggregate small writes, but we haven't make it done as
expected.

Although now newstore's random write performance is bad than
filestore, I think it's not related to design. We still have lots of
things could be apply to improve.

> 3 OVERWRITE theoretically does not benefit from this design, and the
> introducing of fragment, increases the object metadata overhead. The
> possibly mapping of multiple files may also slow down the object
> read/write performance. OVERWRITE is the major scenario for RBD,
> consequently, for cloud environment.

yes, we need to handle this thing. Actually for one object mapping to
multi file, we doesn't have a design(@sage yes? or I missed?). We may
could think of a solution to make tradeoff  :-)

>
> 4 By mapping an object into multiple files, potentially we can optimize
> OVERWRITE by turning it also into APPEND by using small fragments,
> that, actually mimic Btrfs. However, for many small writes, it may
> leave many small files in the backend local file system, that may slow
> down the object read/write performance, especially on spinning
> disk. More importantly, I think it, to some extent, against the
> philosophy of object storage, which uses a big object to store data to
> reduce the metadata cost, and leaves the block management for local
> file system. For a local file system, big file performance is generally
> better than small file. If we introduce fragment, it looks like the
> object storage self cares about the object data allocation now.
>
> What is the community's option?

Anyway, I think the core idea is we make newstore better than
filestore in most of workloads.

>
> Cheers,
> Li Wang
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Some thoughts regarding the new store
  2015-05-27  9:46 ` Haomai Wang
@ 2015-05-27 11:56   ` Mark Nelson
  2015-05-27 21:14   ` Sage Weil
  1 sibling, 0 replies; 4+ messages in thread
From: Mark Nelson @ 2015-05-27 11:56 UTC (permalink / raw)
  To: Haomai Wang, Li Wang; +Cc: Sage Weil, Samuel Just, ceph-devel



On 05/27/2015 04:46 AM, Haomai Wang wrote:
> On Wed, May 27, 2015 at 4:41 PM, Li Wang <liwang@ubuntukylin.com> wrote:
>> I have just noticed the new store development, and had a
>> look at the idea behind it (http://www.spinics.net/lists/ceph-
>> devel/msg22712.html), so my understanding, we wanna avoid the
>> double-write penalty of WRITE_AHEAD_LOGGING journal mechanism,
>> the straightforward thought is to optimize CREATE, APPEND and
>> FULL-OBJECT-OVERWRITE by writing into new files directly,
>> then update the metadata in a transaction. Other changes include:
>> move the object metadata from filesystem extend attrbutes into
>> key value database; map an object into possibly multiple files.
>>
>> If my understanding is correct, then it seems there follows some issues,
>>
>> 1 Garbage collection is needed to reclaim orphan files generated
>> from crashing;
>
> Yes, but still now we haven't dive into this problem. Because
> currently newstore only allow one object one file.
>
> Anyway I guess GC isn't a big problem. journal keys should be help, is
> something I missed here?
>
>>
>> 2 On spinning disks, it loses the advantages that journal makes random
>>   writes into sequential writes, then commits them in groups and
>> leverages another disk to hide the committing delay.
>>
>
> We need to clarify something here, for small random write workload,
> newstore still need journal to make durable and shorter latency.
>
> Although filejournal make use of write ahead to improve performance,
> but journal is far away from data location in disk(partition or
> preallocation file). We always need to write data to disk and the seek
> distance is long I think. For newstore, actually in my best wish
> journal and data could be in one allocation group in local filesystem
> concept(it may be difficult though), just like a ideal fragment
> implementation as expected. In other word, fragment should be
> something to aggregate small writes, but we haven't make it done as
> expected.
>
> Although now newstore's random write performance is bad than
> filestore, I think it's not related to design. We still have lots of
> things could be apply to improve.

FWIW, newstore was looking as good or better for RBD random writes in 
the last set of tests I did:

http://nhm.ceph.com/newstore/8c8c5903_rbd_rados_tests.pdf

>
>> 3 OVERWRITE theoretically does not benefit from this design, and the
>> introducing of fragment, increases the object metadata overhead. The
>> possibly mapping of multiple files may also slow down the object
>> read/write performance. OVERWRITE is the major scenario for RBD,
>> consequently, for cloud environment.
>
> yes, we need to handle this thing. Actually for one object mapping to
> multi file, we doesn't have a design(@sage yes? or I missed?). We may
> could think of a solution to make tradeoff  :-)

This is the biggest issue holding us back right now imho.  If you look 
at the linked graphs above, the only place we are really significantly 
behind filestore is on semi-large partial object overwrites.  I suspect 
we'll have to create fragments down to some size (maybe 512k?).  There 
was some discussion about all of this a couple of weeks ago at the 
weekly perf meeting.

>
>
>>
>> 4 By mapping an object into multiple files, potentially we can optimize
>> OVERWRITE by turning it also into APPEND by using small fragments,
>> that, actually mimic Btrfs. However, for many small writes, it may
>> leave many small files in the backend local file system, that may slow
>> down the object read/write performance, especially on spinning
>> disk. More importantly, I think it, to some extent, against the
>> philosophy of object storage, which uses a big object to store data to
>> reduce the metadata cost, and leaves the block management for local
>> file system. For a local file system, big file performance is generally
>> better than small file. If we introduce fragment, it looks like the
>> object storage self cares about the object data allocation now.
>>
>> What is the community's option?

The cost of large partial overwrites in newstore is pretty expensive.  I 
suspect we'll both need to improve how rocksdb handles it's WAL and I 
think introduce at least semi decently sized fragments.  A simpler 
alternative might be to reduce the default RBD block size and try to 
optimize for that case.  In the report I linked above there are rados 
bench tests at different object sizes to try to get an idea of how rbd 
performance at different block sizes might be bound.

>
> Anyway, I think the core idea is we make newstore better than
> filestore in most of workloads.

I agree.  I think already it's showing significant enough improvement in 
enough cases that it's worth continuing to invest in.

>
>>
>> Cheers,
>> Li Wang
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Some thoughts regarding the new store
  2015-05-27  9:46 ` Haomai Wang
  2015-05-27 11:56   ` Mark Nelson
@ 2015-05-27 21:14   ` Sage Weil
  1 sibling, 0 replies; 4+ messages in thread
From: Sage Weil @ 2015-05-27 21:14 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Li Wang, Samuel Just, ceph-devel

On Wed, 27 May 2015, Haomai Wang wrote:
> On Wed, May 27, 2015 at 4:41 PM, Li Wang <liwang@ubuntukylin.com> wrote:
> > I have just noticed the new store development, and had a
> > look at the idea behind it (http://www.spinics.net/lists/ceph-
> > devel/msg22712.html), so my understanding, we wanna avoid the
> > double-write penalty of WRITE_AHEAD_LOGGING journal mechanism,
> > the straightforward thought is to optimize CREATE, APPEND and
> > FULL-OBJECT-OVERWRITE by writing into new files directly,
> > then update the metadata in a transaction. Other changes include:
> > move the object metadata from filesystem extend attrbutes into
> > key value database; map an object into possibly multiple files.
> >
> > If my understanding is correct, then it seems there follows some issues,
> >
> > 1 Garbage collection is needed to reclaim orphan files generated
> > from crashing;
> 
> Yes, but still now we haven't dive into this problem. Because
> currently newstore only allow one object one file.
> 
> Anyway I guess GC isn't a big problem. journal keys should be help, is
> something I missed here?

Currently we are sloppy.  We could add an additional key on each commit to 
make sure that we don't leak fragments on crash, but honestly I'm not sure 
it's worth the effort given that it's such a tiny amount of space.

sage

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-05-27 21:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-27  8:41 Some thoughts regarding the new store Li Wang
2015-05-27  9:46 ` Haomai Wang
2015-05-27 11:56   ` Mark Nelson
2015-05-27 21:14   ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.