Duplication of dirent names in JFFS2 summary

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* Duplication of dirent names in JFFS2 summary
@ 2006-05-19  0:44 David Woodhouse
  2006-05-19  1:01 ` David Woodhouse
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: David Woodhouse @ 2006-05-19  0:44 UTC (permalink / raw)
  To: Ferenc Havasi; +Cc: linux-mtd

It's a little unfortunate that we have to actually keep a second copy of
the name in the summary. Is there any way we could avoid it?

One possibility might be to refrain from storing the name in the dirent
node itself -- store it _only_ in the summary. This could happen only
during garbage collection -- when we GC a dirent node, we could write it
out _without_ its name. In fact, we could possibly refrain from writing
the GC'd dirent to the log at all -- put it _only_ in the summary.

That does break backwards compatibility though, which isn't ideal.

Another possibility might be to omit the full name from the summary, but
to store only the hash (or name_crc) instead. We hardly ever actually
need the full name -- it's only used if there's a hash collision, in
jffs2_add_fd_to_list().

Further thoughts...? 

Do we have any current data on how much space is taken by summary nodes
on typical file systems, and how much of that is names?

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19  0:44 Duplication of dirent names in JFFS2 summary David Woodhouse
@ 2006-05-19  1:01 ` David Woodhouse
  2006-05-19 11:53   ` David Woodhouse
  2006-05-19  6:05 ` Jörn Engel
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: David Woodhouse @ 2006-05-19  1:01 UTC (permalink / raw)
  To: Ferenc Havasi; +Cc: linux-mtd

While I'm looking at how to reduce the size of the summary... would we
also gain by dropping the 'totlen' field and assuming that each node's
length can be found by subtracting its offset from the offset on the
next node on the flash? (Much as I plan to do to reduce memory usage by
shrinking the struct jffs2_raw_node_ref.)

That'd mean we have to have a summary representation of 'dirty' space,
but there should be relatively few of those in the eraseblocks we've
only just finished writing out, so it should be a gain overall.

We don't actually need 'type' in the jffs2_sum_dirent_flash either, do
we? Depending on the number of crc32 collisions we get and how expensive
it is to read the names from the flash, we might not even need the nsize
field.

And would we benefit by using a variable-length encoding of the integers
in the summary entries? We're far from being CPU-bound when we eat
these, surely?

... which in turn means we should keep storing totlen and drop _offset_,
since we can infer one from the other and the length can always be
stored in fewer bits...

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19  1:01 ` David Woodhouse
@ 2006-05-19 11:53   ` David Woodhouse
  2006-05-19 11:58     ` Artem B. Bityutskiy
  0 siblings, 1 reply; 28+ messages in thread
From: David Woodhouse @ 2006-05-19 11:53 UTC (permalink / raw)
  To: Ferenc Havasi; +Cc: linux-mtd

On Fri, 2006-05-19 at 02:01 +0100, David Woodhouse wrote:
> And would we benefit by using a variable-length encoding of the integers
> in the summary entries? We're far from being CPU-bound when we eat
> these, surely?

Thinking of something vaguely like UTF-8, but more optimal. Each byte
would convey seven bits of information, and the top bit would indicate
whether there's a further byte to be added.

So a number from 0-127 would be stored as-is, while numbers from
128-16383 are stored with seven bits in each of two bytes, with the top
bit of the first set and the top bit of the second clear.

This means we take only two bytes for the common node lengths -- one
byte for nodes up to 508 bytes (obviously we can shift the length >>2
before storing it since it's always a multiple of 4). 

Or should we be more ambitious and actually go for bit-packing instead
of just bytes?

unsigned char *encode_int(unsigned char *p, uint32_t val)
{
	switch (val) {
	case (1<<28) ... 0xffffffff:
		*p++ = 0x80 | ((val >> 28) & 0x7f);
	case (1<<21) ... (1<<28)-1:
		*p++ = 0x80 | ((val >> 21) & 0x7f);
	case (1<<14) ... (1<<21)-1:
		*p++ = 0x80 | ((val >> 14) & 0x7f);
	case (1<<7) ... (1<<14)-1:
		*p++ = 0x80 | ((val >> 7) & 0x7f);
	default:
		*p++ = val & 0x7f;
	}
	return p;
}

unsigned char *decode_int(unsigned char *p, uint32_t *val)
{
	uint32_t ret = 0;
	unsigned char c;

	while ((c = *p++) & 0x80) {
		ret |= c & 0x7f;
		ret <<= 7;
	}
	ret |= c;
	*val = ret;
	return p;
}

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 11:53   ` David Woodhouse
@ 2006-05-19 11:58     ` Artem B. Bityutskiy
  2006-05-19 12:11       ` David Woodhouse
  0 siblings, 1 reply; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-19 11:58 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Fri, 2006-05-19 at 12:53 +0100, David Woodhouse wrote:
> On Fri, 2006-05-19 at 02:01 +0100, David Woodhouse wrote:
> > And would we benefit by using a variable-length encoding of the integers
> > in the summary entries? We're far from being CPU-bound when we eat
> > these, surely?
> 
> Thinking of something vaguely like UTF-8, but more optimal. Each byte
> would convey seven bits of information, and the top bit would indicate
> whether there's a further byte to be added.
Why do you think it really worth it ?

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 11:58     ` Artem B. Bityutskiy
@ 2006-05-19 12:11       ` David Woodhouse
  2006-05-19 12:14         ` Artem B. Bityutskiy
  0 siblings, 1 reply; 28+ messages in thread
From: David Woodhouse @ 2006-05-19 12:11 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On Fri, 2006-05-19 at 15:58 +0400, Artem B. Bityutskiy wrote:
> > Thinking of something vaguely like UTF-8, but more optimal. Each byte
> > would convey seven bits of information, and the top bit would indicate
> > whether there's a further byte to be added.
>
> Why do you think it really worth it ?

If we can halve the amount of space taken on the flash by the summary
nodes, I think that's a worthwhile aim.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 12:11       ` David Woodhouse
@ 2006-05-19 12:14         ` Artem B. Bityutskiy
  2006-05-19 12:31           ` David Woodhouse
  0 siblings, 1 reply; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-19 12:14 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Fri, 2006-05-19 at 13:11 +0100, David Woodhouse wrote:
> If we can halve the amount of space taken on the flash by the summary
> nodes, I think that's a worthwhile aim.
> 
If halve, OK, but why do you think this encoding will halve it?

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 12:14         ` Artem B. Bityutskiy
@ 2006-05-19 12:31           ` David Woodhouse
  2006-05-19 14:34             ` Artem B. Bityutskiy
  0 siblings, 1 reply; 28+ messages in thread
From: David Woodhouse @ 2006-05-19 12:31 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On Fri, 2006-05-19 at 16:14 +0400, Artem B. Bityutskiy wrote:
> On Fri, 2006-05-19 at 13:11 +0100, David Woodhouse wrote:
> > If we can halve the amount of space taken on the flash by the summary
> > nodes, I think that's a worthwhile aim.
> > 
> If halve, OK, but why do you think this encoding will halve it? 

I think overall we can about halve the size of the summary nodes. A
variable-length encoding would be part of that.

struct jffs2_sum_inode_flash
{
	jint16_t nodetype;	/* node type */
	jint32_t inode;		/* inode number */
	jint32_t version;	/* inode version */
	jint32_t offset;	/* offset on jeb */
	jint32_t totlen; 	/* record length */
} __attribute__((packed));

The nodetype can be a single byte. 

Inode number is going to start low, and will take only two bytes if it's
less than 16364 -- three bytes up to 2097152. Storing it as an index
into a table of 'inodes affected by this summary' might also be
worthwhile, since we'll often have many nodes which belong to the same
inode.

Version is also going to start low -- and we can also avoid storing it
for the second and subsequent nodes belonging to any given inode. We
_know_ it only counts up by one at a time.

Offset can be entirely redundant if we represent dirty space with a
separate summary entry -- there shouldn't be much dirty space in the
eraseblock when we've only just finished writing it anyway.

Totlen is going to be small. 

So yes, for the 'struct jffs2_sum_inode_flash' I think we can halve it.

Much of the same goes for the 'struct jffs2_sum_dirent_flash', and we
can also drop the name from that too. So yes, I suspect we can get close
to half the size of that too.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 12:31           ` David Woodhouse
@ 2006-05-19 14:34             ` Artem B. Bityutskiy
  2006-05-19 14:59               ` David Woodhouse
  0 siblings, 1 reply; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-19 14:34 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Fri, 2006-05-19 at 13:31 +0100, David Woodhouse wrote:
> Inode number is going to start low, and will take only two bytes if it's
> less than 16364 -- three bytes up to 2097152. Storing it as an index
> into a table of 'inodes affected by this summary' might also be
> worthwhile, since we'll often have many nodes which belong to the same
> inode.
> 
> Version is also going to start low -- and we can also avoid storing it
> for the second and subsequent nodes belonging to any given inode. We
> _know_ it only counts up by one at a time.
I'd not take seriously a compression which stably degrades with the flow
of time. So, I'd no count version and inode. And I have a feeling that
the gain will not worth the effort, new bugs, new complexities...

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 14:34             ` Artem B. Bityutskiy
@ 2006-05-19 14:59               ` David Woodhouse
  2006-05-19 16:41                 ` David Woodhouse
  0 siblings, 1 reply; 28+ messages in thread
From: David Woodhouse @ 2006-05-19 14:59 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On Fri, 2006-05-19 at 18:34 +0400, Artem B. Bityutskiy wrote:
> I'd not take seriously a compression which stably degrades with the
> flow of time. So, I'd no count version and inode. And I have a feeling
> that the gain will not worth the effort, new bugs, new complexities...

Do we have current figures on how much space the summary nodes actually
take?

What I'd actually do, rather than storing inode number and version just
like that, is have a table stored in the summary, containing tuples of
inode number, and the first version number of that inode which is used.
The actual node entries in the summary would then just contain an index
into that table for the inode number, and an offset for the version.

So if we have an eraseblock which contains the following nodes...

	{ ino# 143454, version# 2341 }
	{ ino# 143454, version# 2342 }
	{ ino# 143454, version# 2343 }
	{ ino# 209343, version# 342 }
	{ ino# 143454, version# 2344 }
	{ ino# 209343, version# 343 }

... that would be represented as a table containing
[0]:	{ ino #143454, first version# 2341 }
[1]:	{ ino #209343, first version# 342 }

... followed by actual summary entries which just refer to that table,
having { index into inode table, offset from first version } tuples...

	{ 0, 0 }
	{ 0, 1 }
	{ 0, 2 }
	{ 1, 0 }
	{ 0, 3 }
	{ 1, 1 }

And in that case, the numbers _would_ compress reliably, even on old
file systems. 

I wonder if using something like rtime would work for compressing
summary blocks, rather than a specific integer encoding...

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 14:59               ` David Woodhouse
@ 2006-05-19 16:41                 ` David Woodhouse
  2006-05-19 16:43                   ` David Woodhouse
  0 siblings, 1 reply; 28+ messages in thread
From: David Woodhouse @ 2006-05-19 16:41 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On Fri, 2006-05-19 at 15:59 +0100, David Woodhouse wrote:
> Do we have current figures on how much space the summary nodes
> actually take?

On the OLPC board -- 512MiB NAND in 4096 * 128KiB eraseblocks.

I have it about 22% full, with copies of /lib, /bin, /etc, /boot
and /usr/bin from a Fedora system -- along with a few nodes in /dev
because I was testing the new device node support.

795 eraseblocks have a summary.

The average size of the summary is 2458 bytes -- about ~1.8% of the
eraseblock size. The minimum was 616, maximum 24608 bytes.

The average size of the _names_ in the summary is 92 bytes, which is
~0.07% of the eraseblock size. Minimum 0, Maximum 3070.

Total space taken by summaries if all 4096 blocks of the file system
were contained summaries at these average sizes would be 9.6 MiB, of
which 370KiB or so would be names.

So we might not actually gain _much_ by removing the duplicate names --
they really aren't a significant contribution to the size of the summary
nodes. It isn't _hard_ to remove them though, and it may well make a lot
of sense not to read them even in the !SUMMARY case too.

If we can remove the 'offset' field from the summary entries and
compress the numbers either by an encoding such as the one I was talking
about and an inode table, or by just using a general-purpose compression
on the summary node before writing it, then I think we probably ought to
get the overhead down to 1% or so from the current 1.8% -- that's
probably worth doing.

Just using rubin compression on the summary nodes would save 35% (thus
they'd be 1.25% of total size) -- by reducing the average to 1708 bytes
per eraseblock.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 16:41                 ` David Woodhouse
@ 2006-05-19 16:43                   ` David Woodhouse
  0 siblings, 0 replies; 28+ messages in thread
From: David Woodhouse @ 2006-05-19 16:43 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On Fri, 2006-05-19 at 17:41 +0100, David Woodhouse wrote:
> Just using rubin compression on the summary nodes would save 35% (thus
> they'd be 1.25% of total size) -- by reducing the average to 1708
> bytes per eraseblock. 

Sorry, that should be 30%. 1708 is 69.5% of 2457.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19  0:44 Duplication of dirent names in JFFS2 summary David Woodhouse
  2006-05-19  1:01 ` David Woodhouse
@ 2006-05-19  6:05 ` Jörn Engel
  2006-05-19 10:08   ` David Woodhouse
  2006-05-19 11:32 ` Artem B. Bityutskiy
  2006-05-19 11:57 ` Artem B. Bityutskiy
  3 siblings, 1 reply; 28+ messages in thread
From: Jörn Engel @ 2006-05-19  6:05 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Fri, 19 May 2006 01:44:24 +0100, David Woodhouse wrote:
> 
> It's a little unfortunate that we have to actually keep a second copy of
> the name in the summary. Is there any way we could avoid it?
> 
> One possibility might be to refrain from storing the name in the dirent
> node itself -- store it _only_ in the summary. This could happen only
> during garbage collection -- when we GC a dirent node, we could write it
> out _without_ its name. In fact, we could possibly refrain from writing
> the GC'd dirent to the log at all -- put it _only_ in the summary.
> 
> That does break backwards compatibility though, which isn't ideal.

Worse, what happens if the power fails?  Do we block creat/mkdir/mknod
until summary is written?  Could be a long time.  Or do we return to
userspace and claim we've written everything while in fact we haven't?

I'd pick the other option.

Jörn

-- 
The story so far:
In the beginning the Universe was created.  This has made a lot
of people very angry and been widely regarded as a bad move.
-- Douglas Adams

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19  6:05 ` Jörn Engel
@ 2006-05-19 10:08   ` David Woodhouse
  0 siblings, 0 replies; 28+ messages in thread
From: David Woodhouse @ 2006-05-19 10:08 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-mtd

On Fri, 2006-05-19 at 08:05 +0200, Jörn Engel wrote:
> Worse, what happens if the power fails?  Do we block creat/mkdir/mknod
> until summary is written?  Could be a long time.  Or do we return to
> userspace and claim we've written everything while in fact we haven't?

That's why I said it could happen only during GC.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19  0:44 Duplication of dirent names in JFFS2 summary David Woodhouse
  2006-05-19  1:01 ` David Woodhouse
  2006-05-19  6:05 ` Jörn Engel
@ 2006-05-19 11:32 ` Artem B. Bityutskiy
  2006-05-19 11:57 ` Artem B. Bityutskiy
  3 siblings, 0 replies; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-19 11:32 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

David Woodhouse wrote:
> It's a little unfortunate that we have to actually keep a second copy of
> the name in the summary. Is there any way we could avoid it?
> 
> One possibility might be to refrain from storing the name in the dirent
> node itself -- store it _only_ in the summary. This could happen only
> during garbage collection -- when we GC a dirent node, we could write it
> out _without_ its name. In fact, we could possibly refrain from writing
> the GC'd dirent to the log at all -- put it _only_ in the summary.
> 
> That does break backwards compatibility though, which isn't ideal.
And it's just bad:

o it's hacky;
o it'll introduce many if()s in code and make it even more copmplecated;
o finally, it'll interfer robustness: if the summary node get's 
corrupted, we may loose a lot - not just one node;

I believe that this approach has to be vetoed.

> Another possibility might be to omit the full name from the summary, but
> to store only the hash (or name_crc) instead. We hardly ever actually
> need the full name -- it's only used if there's a hash collision, in
> jffs2_add_fd_to_list().
Good idea.

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19  0:44 Duplication of dirent names in JFFS2 summary David Woodhouse
                   ` (2 preceding siblings ...)
  2006-05-19 11:32 ` Artem B. Bityutskiy
@ 2006-05-19 11:57 ` Artem B. Bityutskiy
  2006-05-19 12:05   ` Artem B. Bityutskiy
  3 siblings, 1 reply; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-19 11:57 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Fri, 2006-05-19 at 01:44 +0100, David Woodhouse wrote:
> Another possibility might be to omit the full name from the summary, but
> to store only the hash (or name_crc) instead. We hardly ever actually
> need the full name -- it's only used if there's a hash collision, in
> jffs2_add_fd_to_list().

About reading names during scanning.

The only reason why we read direntry names during scanning is to find
out which directory entries were deleted. There is no other reason to
read names.

Let's think if we may avoid reading direntry names at all. Yes, it is
possible I think - we just have to postpone this for later. But I don't
see a simple way do do this. Ideas?

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 11:57 ` Artem B. Bityutskiy
@ 2006-05-19 12:05   ` Artem B. Bityutskiy
  2006-05-19 12:23     ` David Woodhouse
  0 siblings, 1 reply; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-19 12:05 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Fri, 2006-05-19 at 15:57 +0400, Artem B. Bityutskiy wrote:
> About reading names during scanning.
> 
> The only reason why we read direntry names during scanning is to find
> out which directory entries were deleted. There is no other reason to
> read names.
> 
> Let's think if we may avoid reading direntry names at all. Yes, it is
> possible I think - we just have to postpone this for later. But I don't
> see a simple way do do this. Ideas?
> 
I mean, we don't have to know the right inode refcount values at the
scanning time, may we calculate it at the readinode() time?

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 12:05   ` Artem B. Bityutskiy
@ 2006-05-19 12:23     ` David Woodhouse
  2006-05-19 14:17       ` Artem B. Bityutskiy
  0 siblings, 1 reply; 28+ messages in thread
From: David Woodhouse @ 2006-05-19 12:23 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On Fri, 2006-05-19 at 16:05 +0400, Artem B. Bityutskiy wrote:
> I mean, we don't have to know the right inode refcount values at the
> scanning time, may we calculate it at the readinode() time?

Stop. Look at where we use the _name_. Hint: It's only in
jffs2_add_fd_to_list(). Observe that we almost never need it, and we can
actually fetch it from the flash only when we have to.

Ponder whether that's going to be an improvement overall. It means the
summary shrinks, so takes less space and is faster to read because it
doesn't contain any names. The down side is that _sometimes_ when
there's a crc32 collision, you have to actually read the name for some
nodes from the flash after reading the summary.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 12:23     ` David Woodhouse
@ 2006-05-19 14:17       ` Artem B. Bityutskiy
  2006-05-19 14:50         ` David Woodhouse
  0 siblings, 1 reply; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-19 14:17 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Fri, 2006-05-19 at 13:23 +0100, David Woodhouse wrote:
> On Fri, 2006-05-19 at 16:05 +0400, Artem B. Bityutskiy wrote:
> > I mean, we don't have to know the right inode refcount values at the
> > scanning time, may we calculate it at the readinode() time?
> 
> Stop. Look at where we use the _name_. Hint: It's only in
> jffs2_add_fd_to_list(). 
True.

> Observe that we almost never need it, and we can
> actually fetch it from the flash only when we have to.
Let's talk not in 'summary' context. Suppose there is no summary. I'm
talking about how to avoid reading names (and CRC-checking) in this
case. 

WIn the current implementation, we have to read names. At the source
code-level language, because we need nhash at jffs2_add_fd_to_list().

But I'll repeat my statement. The only fundamental reason why we need
names at the scanning time is to provide correct nlink of inodes. This
is because the way how JFFS2 unlinks: it writes direntry with the same
name but with target ino=0.

So, if we'd not read names at the scanning time at all, we'd have no
possibility to calculate correct nlinks. They'd be greater or equivalent
to the correct values. And some dead inodes would no go at scan time.
And my idea: so what? Nothing really bad in this. We'll adjust them
later.

> Ponder whether that's going to be an improvement overall. It means the
> summary shrinks, so takes less space and is faster to read because it
> doesn't contain any names. The down side is that _sometimes_ when
> there's a crc32 collision, you have to actually read the name for some
> nodes from the flash after reading the summary.

So, we can skip reading names. But then we'll have incorrect nlinks
after scanning. And what I'm saying is that it is possible to refine
nlinks at readinode() time.

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 14:17       ` Artem B. Bityutskiy
@ 2006-05-19 14:50         ` David Woodhouse
  2006-05-19 15:07           ` Artem B. Bityutskiy
  0 siblings, 1 reply; 28+ messages in thread
From: David Woodhouse @ 2006-05-19 14:50 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On Fri, 2006-05-19 at 18:17 +0400, Artem B. Bityutskiy wrote:
> Let's talk not in 'summary' context. Suppose there is no summary. I'm
> talking about how to avoid reading names (and CRC-checking) in this
> case. 

OK.

> In the current implementation, we have to read names. At the source
> code-level language, because we need nhash at jffs2_add_fd_to_list().

We could use the name_crc for that instead. We don't actually need the
name itself. It just has to be a hash of the name -- the Linux nhash
just happened to be a good, cheap one.

> But I'll repeat my statement. The only fundamental reason why we need
> names at the scanning time is to provide correct nlink of inodes. This
> is because the way how JFFS2 unlinks: it writes direntry with the same
> name but with target ino=0.

We still only need the name _if_ the crc32 and length match a previous
dirent in the same directory. For ~99% of dirents, we _don't_ need the
name at scan time at all.

We only actually need to look at the name in the case where there are
two or more dirents which have the same pino, crc32 (or nhash) and
length.

Let us assume that genuine hash collisions are rare enough to be
ignored, statistically. So on NOR flash without summary, where we
actually mark the old nodes obsolete, you'll almost _never_ actually
have to use the real name in jffs2_add_fd_to_list(). You might as well
not read it during the scan.

On NAND flash, or with summary, we don't mark old nodes obsolete. So
it'll happen a bit more often -- specifically, in two cases:

First, there's the case where we've garbage-collected a dirent node but
haven't _yet_ deleted the original. But deleting the original is the
whole _point_ in GC, so that's not a common case. 

Second, there's the case where we've unlinked a dirent node but the
original hasn't yet been actually erased. That may be a _little_ more
common, but I don't think it'll be _so_ common.

So I still think that the benefit we get from dropping the full name
from all the rest of the dirents in the summary (or just not reading it
at scan time, in the !SUMMARY version) is going to be _more_ than the
slowdown caused by occasionally having to go back to the flash and read
the name when there's a collision.

If the time taken by reading names later really _is_ significant, we
could perhaps reduce that further by including "obsoleted version
number" in the summary for dirent nodes.

> So, if we'd not read names at the scanning time at all, we'd have no
> possibility to calculate correct nlinks.

Not so. We could calculate correct nlinks in the majority of cases. We
really don't need the full name very often.

>  They'd be greater or equivalent to the correct values. And some dead
> inodes would no go at scan time. And my idea: so what? Nothing really
> bad in this. We'll adjust them later.

But nlink is visible to userspace. We'd be giving incorrect values to
userspace until we've finished the scan.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 14:50         ` David Woodhouse
@ 2006-05-19 15:07           ` Artem B. Bityutskiy
  2006-05-19 15:26             ` Artem B. Bityutskiy
  2006-05-19 15:37             ` David Woodhouse
  0 siblings, 2 replies; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-19 15:07 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Fri, 2006-05-19 at 15:50 +0100, David Woodhouse wrote:
> We could use the name_crc for that instead. We don't actually need the
> name itself. It just has to be a hash of the name -- the Linux nhash
> just happened to be a good, cheap one.
Fair enough, indeed. Using CRC on scan, then nhash() later.

> We still only need the name _if_ the crc32 and length match a previous
> dirent in the same directory. For ~99% of dirents, we _don't_ need the
> name at scan time at all.
Ok, fair.

> We only actually need to look at the name in the case where there are
> two or more dirents which have the same pino, crc32 (or nhash) and
> length.
Fair.

> Let us assume that genuine hash collisions are rare enough to be
> ignored, statistically. So on NOR flash without summary, where we
> actually mark the old nodes obsolete, you'll almost _never_ actually
> have to use the real name in jffs2_add_fd_to_list(). You might as well
> not read it during the scan.
Yes.

> On NAND flash, or with summary, we don't mark old nodes obsolete. So
> it'll happen a bit more often -- specifically, in two cases:
> 
> First, there's the case where we've garbage-collected a dirent node but
> haven't _yet_ deleted the original. But deleting the original is the
> whole _point_ in GC, so that's not a common case.
Sorry, -ENOPARSE on "But deleting the original is the whole _point_ in
GC, so that's not a common case."

> Second, there's the case where we've unlinked a dirent node but the
> original hasn't yet been actually erased. That may be a _little_ more
> common, but I don't think it'll be _so_ common.
Ok.

> So I still think that the benefit we get from dropping the full name
> from all the rest of the dirents in the summary (or just not reading it
> at scan time, in the !SUMMARY version) is going to be _more_ than the
> slowdown caused by occasionally having to go back to the flash and read
> the name when there's a collision.
I assume so too.

> If the time taken by reading names later really _is_ significant, we
> could perhaps reduce that further by including "obsoleted version
> number" in the summary for dirent nodes.
Probably.

> > So, if we'd not read names at the scanning time at all, we'd have no
> > possibility to calculate correct nlinks.
> 
> Not so. We could calculate correct nlinks in the majority of cases. We
> really don't need the full name very often.
Majority is a bad word here.

> >  They'd be greater or equivalent to the correct values. And some dead
> > inodes would no go at scan time. And my idea: so what? Nothing really
> > bad in this. We'll adjust them later.
> 
> But nlink is visible to userspace. We'd be giving incorrect values to
> userspace until we've finished the scan.

1. During scan nobody can access JFFS2, so "giving incorrect values to
userspace until we've finished the scan" is wrong.
2. Before accessing *any* inode, read_inode() is called. So what I'm
saying is to calculate *correct* nlink at the read_inode() time.

So, we're still talking about different things. Probably I'm expressing
my thought horribly.

You are talking about how to use hashes instead of names. What is
common, what is not, what do do in case of collisions.

I'm talking about how to avoid looking at names at all. At hashes too.
See the difference? No collisions.

I'm repeating for the 3rd time my statement: names (or their hashes) are
only needed to calculate correct nlinks. We don't habe to calculate
correct nlinks at the *scan* time, we may do try to do this at the
*read_inode()* time. See what I mean? 

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 15:07           ` Artem B. Bityutskiy
@ 2006-05-19 15:26             ` Artem B. Bityutskiy
  2006-05-19 15:33               ` David Woodhouse
  2006-05-19 15:37             ` David Woodhouse
  1 sibling, 1 reply; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-19 15:26 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

It;s getting late, I'm doing too many typos. Retype.

On Fri, 2006-05-19 at 19:07 +0400, Artem B. Bityutskiy wrote:
> 1. During scan nobody can access JFFS2, so "giving incorrect values to
> userspace until we've finished the scan" is wrong.
> 2. Before accessing *any* inode, read_inode() is called. So what I'm
> saying is to calculate *correct* nlink at the read_inode() time.
> 
> So, we're still talking about different things. Probably I'm expressing
> my thought horribly.
> 
> You are talking about how to use hashes instead of names. What is
> common, what is not, what do do in case of collisions.
> 
> I'm talking about how to avoid looking at names at all. At hashes too.
> See the difference? No collisions.
> 
> I'm repeating for the 3rd time my statement: names (or their hashes) are
> only needed to calculate correct nlinks. We don't habe to calculate
> correct nlinks at the *scan* time, we may do try to do this at the
> *read_inode()* time. See what I mean? 
> 
1. During scan nobody can access JFFS2, so "giving incorrect values to
userspace until we've finished the scan" is wrong.
2. Before accessing *any* inode, read_inode() is called. So what I'm
saying is to calculate *correct* nlink at the read_inode() time.

So, we're still talking about different things. Probably I'm expressing
my thoughts horribly.

You are talking about how to use hashes instead of names. What is
common, what is not, what to do in case of collisions.

I'm talking about how to avoid looking at names at all. At hashes too.
See the difference? No collisions.

I'm repeating for the 3rd time my statement: names (or their hashes) are
only needed to calculate correct nlinks. We don't have to calculate
correct nlinks at the *scan* time, we may try to do this at the
*read_inode()* time. See what I mean? 

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 15:26             ` Artem B. Bityutskiy
@ 2006-05-19 15:33               ` David Woodhouse
  2006-05-19 15:38                 ` Artem B. Bityutskiy
  0 siblings, 1 reply; 28+ messages in thread
From: David Woodhouse @ 2006-05-19 15:33 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On Fri, 2006-05-19 at 19:26 +0400, Artem B. Bityutskiy wrote:
> 1. During scan nobody can access JFFS2, so "giving incorrect values to
> userspace until we've finished the scan" is wrong.

I shouldn't have said 'scan'. I should have said 'until the GC thread
has finished checking nodes'. 

> 2. Before accessing *any* inode, read_inode() is called. So what I'm
> saying is to calculate *correct* nlink at the read_inode() time.

You can't do that. You don't have correct nlink until you've checked all
the dirents in _all_ the directories on the file system.

> I'm repeating for the 3rd time my statement: names (or their hashes) are
> only needed to calculate correct nlinks. We don't have to calculate
> correct nlinks at the *scan* time, we may try to do this at the
> *read_inode()* time. See what I mean? 

I know what you mean, and it's _still_ not correct.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 15:33               ` David Woodhouse
@ 2006-05-19 15:38                 ` Artem B. Bityutskiy
  2006-05-19 15:43                   ` David Woodhouse
  0 siblings, 1 reply; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-19 15:38 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Fri, 2006-05-19 at 16:33 +0100, David Woodhouse wrote:
> I shouldn't have said 'scan'. I should have said 'until the GC thread
> has finished checking nodes'. 
> 
> > 2. Before accessing *any* inode, read_inode() is called. So what I'm
> > saying is to calculate *correct* nlink at the read_inode() time.
> 
> You can't do that. You don't have correct nlink until you've checked all
> the dirents in _all_ the directories on the file system.
Err... I don't understand. Suppose you've scanned your FS and mounted
it.

Then the 1st call is read_inode("/"). Here you calculate nlink of all
direntries od "/".

The 2nd, say, read_inode("/home"). Here you calculate nlink of all
direntries of "/home".

And so forth. I don't see the problem.

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 15:38                 ` Artem B. Bityutskiy
@ 2006-05-19 15:43                   ` David Woodhouse
  2006-05-19 15:46                     ` Artem B. Bityutskiy
  2006-05-20  8:38                     ` Artem B. Bityutskiy
  0 siblings, 2 replies; 28+ messages in thread
From: David Woodhouse @ 2006-05-19 15:43 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On Fri, 2006-05-19 at 19:38 +0400, Artem B. Bityutskiy wrote:
> Err... I don't understand. Suppose you've scanned your FS and mounted
> it.
> 
> Then the 1st call is read_inode("/"). Here you calculate nlink of all
> direntries od "/".
> 
> The 2nd, say, read_inode("/home"). Here you calculate nlink of all
> direntries of "/home".
> 
> And so forth. I don't see the problem. 

That's because you haven't actually thought it through.

Q: What affects nlink? If there's a file 'foo' in the '/' directory,
which is inode #365 -- where do you look for dirents which might affect
the nlink of that inode #365?

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 15:43                   ` David Woodhouse
@ 2006-05-19 15:46                     ` Artem B. Bityutskiy
  2006-05-20  8:38                     ` Artem B. Bityutskiy
  1 sibling, 0 replies; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-19 15:46 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Fri, 2006-05-19 at 16:43 +0100, David Woodhouse wrote:
> Q: What affects nlink? If there's a file 'foo' in the '/' directory,
> which is inode #365 -- where do you look for dirents which might affect
> the nlink of that inode #365?
> 
OK, got it. Thanks.

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 15:43                   ` David Woodhouse
  2006-05-19 15:46                     ` Artem B. Bityutskiy
@ 2006-05-20  8:38                     ` Artem B. Bityutskiy
  2006-05-20  9:08                       ` Artem B. Bityutskiy
  1 sibling, 1 reply; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-20  8:38 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Fri, 2006-05-19 at 16:43 +0100, David Woodhouse wrote:
> Q: What affects nlink? If there's a file 'foo' in the '/' directory,
> which is inode #365 -- where do you look for dirents which might affect
> the nlink of that inode #365?

One more idea that came to my mind today: your approach with nhashes
seems good, but it has one drawback - collisions. They may be rare, but
we'll anyway deal with them - more code, more testing.

How about to introduce "direntry IDs"? Just 32 bit values, constantly
incremented, per-inode. They'd act instead of names. We'd not need names
at all, as those IDs would play their role. Advantages of fixed-width
IDs vs variable length names are obvious.

I suspect it'd simplify and improve many things, but I'm not sure we
could make it compatible.

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-20  8:38                     ` Artem B. Bityutskiy
@ 2006-05-20  9:08                       ` Artem B. Bityutskiy
  0 siblings, 0 replies; 28+ messages in thread
From: Artem B. Bityutskiy @ 2006-05-20  9:08 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

On Sat, 2006-05-20 at 12:38 +0400, Artem B. Bityutskiy wrote:
> How about to introduce "direntry IDs"? Just 32 bit values, constantly
> incremented, per-inode. They'd act instead of names. We'd not need names
> at all, as those IDs would play their role. Advantages of fixed-width
> IDs vs variable length names are obvious.
> 
Oh, this won't work :-(

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Duplication of dirent names in JFFS2 summary
  2006-05-19 15:07           ` Artem B. Bityutskiy
  2006-05-19 15:26             ` Artem B. Bityutskiy
@ 2006-05-19 15:37             ` David Woodhouse
  1 sibling, 0 replies; 28+ messages in thread
From: David Woodhouse @ 2006-05-19 15:37 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On Fri, 2006-05-19 at 19:07 +0400, Artem B. Bityutskiy wrote:
> > First, there's the case where we've garbage-collected a dirent node but
> > haven't _yet_ deleted the original. But deleting the original is the
> > whole _point_ in GC, so that's not a common case.
> Sorry, -ENOPARSE on "But deleting the original is the whole _point_ in
> GC, so that's not a common case." 

Q: Why do we write new nodes out during GC?
A: So that we can delete the original.

Q: What is the point in GC?
A: The point in GC is that we can delete the original nodes. 

Better now?

My point is that when we've written out a colliding node during garbage
collection, that is because we intend to delete the original node real
soon now -- that deletion is what the GC is _for_. That's the whole
_point_ in the GC.

So we needn't worry too much about these collisions -- they'll be
short-lived and hence uncommon.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-05-20  9:09 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-19  0:44 Duplication of dirent names in JFFS2 summary David Woodhouse
2006-05-19  1:01 ` David Woodhouse
2006-05-19 11:53   ` David Woodhouse
2006-05-19 11:58     ` Artem B. Bityutskiy
2006-05-19 12:11       ` David Woodhouse
2006-05-19 12:14         ` Artem B. Bityutskiy
2006-05-19 12:31           ` David Woodhouse
2006-05-19 14:34             ` Artem B. Bityutskiy
2006-05-19 14:59               ` David Woodhouse
2006-05-19 16:41                 ` David Woodhouse
2006-05-19 16:43                   ` David Woodhouse
2006-05-19  6:05 ` Jörn Engel
2006-05-19 10:08   ` David Woodhouse
2006-05-19 11:32 ` Artem B. Bityutskiy
2006-05-19 11:57 ` Artem B. Bityutskiy
2006-05-19 12:05   ` Artem B. Bityutskiy
2006-05-19 12:23     ` David Woodhouse
2006-05-19 14:17       ` Artem B. Bityutskiy
2006-05-19 14:50         ` David Woodhouse
2006-05-19 15:07           ` Artem B. Bityutskiy
2006-05-19 15:26             ` Artem B. Bityutskiy
2006-05-19 15:33               ` David Woodhouse
2006-05-19 15:38                 ` Artem B. Bityutskiy
2006-05-19 15:43                   ` David Woodhouse
2006-05-19 15:46                     ` Artem B. Bityutskiy
2006-05-20  8:38                     ` Artem B. Bityutskiy
2006-05-20  9:08                       ` Artem B. Bityutskiy
2006-05-19 15:37             ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox