incorrect object stat sum in PG info after pg split

All of lore.kernel.org
 help / color / mirror / Atom feed

* incorrect object stat sum in PG info after pg split
@ 2017-01-10 10:03 caifeng.zhu
  2017-01-10 12:44 ` Sage Weil
  0 siblings, 1 reply; 3+ messages in thread
From: caifeng.zhu @ 2017-01-10 10:03 UTC (permalink / raw)
  To: ceph-devel

Hi, all

We find that after the number of pgs increased, the object stat sum
in pg info is incorrect. 

The following steps can reproduce the problem.
0 assume the object store is a filestore.
1 create a pool 'foo' with the number of pgs such as 64.
2 write data through clients(rbd, cephfs or rgw) into the pool 'foo'.
3 increase the number of pgs in the pool 'foo' to such as 128.
4 after pgs are settled, use 'ceph pg x.y query' to look at the field
  'num_objects'
5 find the osd shard where pg x.y resides by 'ceph pg map x.y' and
  count the number of objects in the osd shard by command like 
  'find /var/lib/ceph/osd/ceph-0/current/x.y_head/ -type f | wc -l'

The code flow to increase the pg number is as follows:
OSD::advance_pg
	-> OSD::split_pgs
		-> object_stat_sum::split
	-> ReplicatedPG::split_colls
		-> PG::_create
		-> ObjectStore::Transaction::split_collection
			/* indirectly call FileStore::_split_collection 
			 * when applying transaction into file system.
			 */
	-> PG::split_into

Compare object_stat_sum::split with FileStore::_split_collection, the splitting
logic is different and makes stat.sum different from the actual number of objects
in the collection. 

The question is that should we fix this difference? If so, how to fix? 
In current design, it seems very difficult to fix the problem.

A similar bug is reported as tracker.ceph.com/issues/16671, which will occur
if all the exitent data in pool 'foo' is deleted. 

Best Regards

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: incorrect object stat sum in PG info after pg split
  2017-01-10 10:03 incorrect object stat sum in PG info after pg split caifeng.zhu
@ 2017-01-10 12:44 ` Sage Weil
  2017-01-11  8:08   ` caifeng.zhu
  0 siblings, 1 reply; 3+ messages in thread
From: Sage Weil @ 2017-01-10 12:44 UTC (permalink / raw)
  To: caifeng.zhu; +Cc: ceph-devel

On Tue, 10 Jan 2017, caifeng.zhu@uniswdc.com wrote:
> Hi, all
> 
> We find that after the number of pgs increased, the object stat sum
> in pg info is incorrect. 
> 
> The following steps can reproduce the problem.
> 0 assume the object store is a filestore.
> 1 create a pool 'foo' with the number of pgs such as 64.
> 2 write data through clients(rbd, cephfs or rgw) into the pool 'foo'.
> 3 increase the number of pgs in the pool 'foo' to such as 128.
> 4 after pgs are settled, use 'ceph pg x.y query' to look at the field
>   'num_objects'
> 5 find the osd shard where pg x.y resides by 'ceph pg map x.y' and
>   count the number of objects in the osd shard by command like 
>   'find /var/lib/ceph/osd/ceph-0/current/x.y_head/ -type f | wc -l'
> 
> The code flow to increase the pg number is as follows:
> OSD::advance_pg
> 	-> OSD::split_pgs
> 		-> object_stat_sum::split
> 	-> ReplicatedPG::split_colls
> 		-> PG::_create
> 		-> ObjectStore::Transaction::split_collection
> 			/* indirectly call FileStore::_split_collection 
> 			 * when applying transaction into file system.
> 			 */
> 	-> PG::split_into
> 
> Compare object_stat_sum::split with FileStore::_split_collection, the splitting
> logic is different and makes stat.sum different from the actual number of objects
> in the collection. 
> 
> The question is that should we fix this difference? If so, how to fix? 
> In current design, it seems very difficult to fix the problem.

Right, it's expected to be out of sync.  The pg_stats structure has a bool 
flag indicating the stats are not strictly accurate (only an 
approximation), and will be corrected during the next scrub.  You can 
force this to happen explicitly on a test pg with 'ceph pg scrub <pgid>' 
and then verif that afterwards the stats are accurate.  You can also see 
the full stats strcuture (including the flag) with 'ceph pg dump -f 
json-pretty'.

It would be very hard to make the ObjectStore backend (FileStore or 
BlueStore) be able to split a collection in O(1) time *and* provide an 
accurate split of the stats (and its many fields) as well.  And not that 
important; the approximation is sufficient for most purposes.  The only 
one it's not good enough for is the cache tiering agent; that is disabled 
until the next scrub happens on the PG.

sage

> 
> A similar bug is reported as tracker.ceph.com/issues/16671, which will occur
> if all the exitent data in pool 'foo' is deleted. 
> 
> Best Regards
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: incorrect object stat sum in PG info after pg split
  2017-01-10 12:44 ` Sage Weil
@ 2017-01-11  8:08   ` caifeng.zhu
  0 siblings, 0 replies; 3+ messages in thread
From: caifeng.zhu @ 2017-01-11  8:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: caifeng.zhu, ceph-devel

Hi, Sage

Thanks for your suggestion. It works for us.

Best Regards

On Tue, Jan 10, 2017 at 12:44:50PM +0000, Sage Weil wrote:
> On Tue, 10 Jan 2017, caifeng.zhu@uniswdc.com wrote:
> > Hi, all
> > 
> > We find that after the number of pgs increased, the object stat sum
> > in pg info is incorrect. 
> > 
> > The following steps can reproduce the problem.
> > 0 assume the object store is a filestore.
> > 1 create a pool 'foo' with the number of pgs such as 64.
> > 2 write data through clients(rbd, cephfs or rgw) into the pool 'foo'.
> > 3 increase the number of pgs in the pool 'foo' to such as 128.
> > 4 after pgs are settled, use 'ceph pg x.y query' to look at the field
> >   'num_objects'
> > 5 find the osd shard where pg x.y resides by 'ceph pg map x.y' and
> >   count the number of objects in the osd shard by command like 
> >   'find /var/lib/ceph/osd/ceph-0/current/x.y_head/ -type f | wc -l'
> > 
> > The code flow to increase the pg number is as follows:
> > OSD::advance_pg
> > 	-> OSD::split_pgs
> > 		-> object_stat_sum::split
> > 	-> ReplicatedPG::split_colls
> > 		-> PG::_create
> > 		-> ObjectStore::Transaction::split_collection
> > 			/* indirectly call FileStore::_split_collection 
> > 			 * when applying transaction into file system.
> > 			 */
> > 	-> PG::split_into
> > 
> > Compare object_stat_sum::split with FileStore::_split_collection, the splitting
> > logic is different and makes stat.sum different from the actual number of objects
> > in the collection. 
> > 
> > The question is that should we fix this difference? If so, how to fix? 
> > In current design, it seems very difficult to fix the problem.
> 
> Right, it's expected to be out of sync.  The pg_stats structure has a bool 
> flag indicating the stats are not strictly accurate (only an 
> approximation), and will be corrected during the next scrub.  You can 
> force this to happen explicitly on a test pg with 'ceph pg scrub <pgid>' 
> and then verif that afterwards the stats are accurate.  You can also see 
> the full stats strcuture (including the flag) with 'ceph pg dump -f 
> json-pretty'.
> 
> It would be very hard to make the ObjectStore backend (FileStore or 
> BlueStore) be able to split a collection in O(1) time *and* provide an 
> accurate split of the stats (and its many fields) as well.  And not that 
> important; the approximation is sufficient for most purposes.  The only 
> one it's not good enough for is the cache tiering agent; that is disabled 
> until the next scrub happens on the PG.
> 
> sage
> 
> > 
> > A similar bug is reported as tracker.ceph.com/issues/16671, which will occur
> > if all the exitent data in pool 'foo' is deleted. 
> > 
> > Best Regards
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-01-11  8:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-10 10:03 incorrect object stat sum in PG info after pg split caifeng.zhu
2017-01-10 12:44 ` Sage Weil
2017-01-11  8:08   ` caifeng.zhu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.