OSD Sessions at CDS

All of lore.kernel.org
 help / color / mirror / Atom feed

* OSD Sessions at CDS
@ 2015-03-05 17:23 Samuel Just
  2015-03-06 14:07 ` GuangYang
  0 siblings, 1 reply; 2+ messages in thread
From: Samuel Just @ 2015-03-05 17:23 UTC (permalink / raw)
  To: ceph-devel

There were several OSD sessions at CDS on Wednesday, I'll try to
summarize some of the key points.

======================EC Pool Overwrite Support=======================
https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%
3A_erasure_coding_pool_overwrite_support

One take away from the discussion was that the no overwrite option for
RBD and cephfs may not be feasible since it's not clear that 4MB objects
make sense for an EC pool, and that with cephfs we need to be able to
handle the case where the file is in shared mode.  We'd probably,
therefore, want to use a 2pc approach, but we'd want much more feedback
on use cases before implementing it ourselves.

========================Scrub and Repair==============================

https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%
3A_Scrub_and_Repair

http://pad.ceph.com/p/I-osd-scrub

The discussion focused mainly on a more detailed description of the
scrub state kept by the OSD during peering.  See the etherpad for
details.

=======================Less Intrusive Scrub===========================

https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%
3A_Less_intrusive_scrub

http://pad.ceph.com/p/I-osd-less-intrusive-scrub

Some additional things we can do to reduce the scrub impact came up
and can be found in the above etherpad.

=================Faster Peering/Lower Tail Latency====================

https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%
3A_Faster_Peering

https://wiki.ceph.com/Planning/Blueprints/Infernalis/Improve_tail_latency

http://pad.ceph.com/p/I-faster-peering_tailing

In addition to what is in the blueprint, Sage suggested that the primary
in some cases can keep the peer_info and peer_missing sets which it
already has if the acting set stays the same or shrinks.

We also touched on prepopulating pg_temp at the monitor and setting a
different temp pg primary at the monitor in the map which marks an osd
back up to avoid that pg being primary immediately (and having to block
reads and writes on recovery).

In the ungraceful shutdown case, we could have a watchdog process
(systemd or something else) mark the specific osd instance which stopped
down (ceph osd down-instance <entity_inst_t>).

For EC pools, the consensus seemed to be that the best way to reduce
read latencies is to implement client side reads.

========================Tiering II (Warm->Cold)========================

https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%
3A_Tiering_II_(Warm-%3ECold)

https://wiki.ceph.com/Planning/Blueprints/Infernalis/Dynamic_data_relocation_for_cache_tiering

http://pad.ceph.com/p/I-tiering

Sage and I spent some time comparing the approach above to the approach
from the firefly CDS below.  It's still not clear whether we might want
to do the firefly variant (with the client able to send IO directly to
the cold tier) in addition to the one above (where the cold tier may not
even be a rados pool).

https://wiki.ceph.com/Planning/Blueprints/%3CSIDEBOARD%3E/osd%3A_tiering
%3A_object_redirects

From the discussion, it seemed like it might make sense to expand the
interface somewhat to allow the osd to proxy partial overwrites if the
backend supports it.

The consensus seemed to be that a rados level pin operation to force an
object to stay in the hot tier woudld be a good idea.

-Sam

^ permalink raw reply	[flat|nested] 2+ messages in thread

* RE: OSD Sessions at CDS
  2015-03-05 17:23 OSD Sessions at CDS Samuel Just
@ 2015-03-06 14:07 ` GuangYang
  0 siblings, 0 replies; 2+ messages in thread
From: GuangYang @ 2015-03-06 14:07 UTC (permalink / raw)
  To: Samuel Just, ceph-devel@vger.kernel.org

> =================Faster Peering/Lower Tail Latency====================
>
> https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%
> 3A_Faster_Peering
>
> https://wiki.ceph.com/Planning/Blueprints/Infernalis/Improve_tail_latency
>
> http://pad.ceph.com/p/I-faster-peering_tailing
>
> In addition to what is in the blueprint, Sage suggested that the primary
> in some cases can keep the peer_info and peer_missing sets which it
> already has if the acting set stays the same or shrinks.
>
> We also touched on prepopulating pg_temp at the monitor and setting a
> different temp pg primary at the monitor in the map which marks an osd
> back up to avoid that pg being primary immediately (and having to block
> reads and writes on recovery).
>
Hi Sam,
With our experience, the peering is more painful when the OSD(s) stayed down (but still in) for a while and then got up, for example, the OSD crashed or one OSD host crashed without notice (or it takes time to repair the hardware), when it is up, it will need to populate the PG::recovery_map, say there are N objects missing, and there are M replicas, currently the complexity of the search for missing is N*M*logN. When N is large (OSD down for a while), and M is large (EC pool), and many PGs are going through this process, it is non-trivial. Tracker #9558 has some logs with more details. I am thinking a simple optimization is to detect the case that only 1 replica (in the actingbackfill set) has missing and all others are complete, we can simply populate the recovery_map by specifying (M - 1) replicas who does not have any missing as recovery source, this could improve the complexity to N*logN.

Does that make sense? If it does, I will go ahead providing a patch.

Thanks,
Guang 		 	   		  

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2015-03-06 14:07 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-05 17:23 OSD Sessions at CDS Samuel Just
2015-03-06 14:07 ` GuangYang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.