All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mingming Cao <mingming.cao@oracle.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-ext4@vger.kernel.org
Subject: Re: [RFC 1/2] btree header
Date: Tue, 23 Jun 2015 22:14:19 -0600	[thread overview]
Message-ID: <558A2E9B.6080808@oracle.com> (raw)
In-Reply-To: <20150623193333.GA10037@birch.djwong.org>

On 06/23/2015 01:33 PM, Darrick J. Wong wrote:
> On Mon, Jun 22, 2015 at 08:24:37PM -0700, mingming cao wrote:
>> ---
>>  ext4_btree.h | 146 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 146 insertions(+)
>>  create mode 100644 ext4_btree.h
>>
>> diff --git a/ext4_btree.h b/ext4_btree.h
>> new file mode 100644
>> index 0000000..efd5ce3
>> --- /dev/null
>> +++ b/ext4_btree.h
>> @@ -0,0 +1,146 @@
>> +/*
>> + * copyright (C) 2015 Oracle.  All rights reserved.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public
>> + * License v2 as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public
>> + * License along with this program; if not, write to the
>> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
>> + * Boston, MA 021110-1307, USA.
>> + */
>> +
>> +#ifndef EXT4_BTREE_H
>> +#define EXT4_BTREE_H
>> +
>> +/*
>> + * This btree geometry structure is used to defines how the btree block
>> + * layout for different type of btrees. This is used to implement a generic
>> + * btree implementation. The record length, value lenth, etc, are variable
>> + * based on different btree type, like reflink btree, and other btree use cases.
>> + */
>> +struct ext4_btree_geo {
>> +	__le16 node_len;
>> +	__le16 header_len;
>> +	__le16 key_len;
>> +	__le16 val_len;
>> +	__le16 rec_len;
> (I wonder if these four _len values could be u8, but whatever...)

okay...

>
>> +	__le16 max_pairs;
>> +	__le16 max_records;
>> +};
> Does this geometry structure live on disk?  The __le prefix suggests that it
> does, but the only definition of one of these is the in-core ref_btree_geo,
> which means that u16 is sufficient.

At the beginning since the prototype is a in memory btree, the geometry structure was not on disk, and declared globally. So I had those values to be u8 before... but I am planning to store those layout info on disk... that's why here I made change to __le, but hasnt really update the rest of code to reflect the geo is stored on disk yet. One question I have is whethere to store the geometry structure into the btree block header. Maybe only for index node...
>
> Given that the ext4_btree_geo seems to contain all the incore info about the
> btree you're working with, it ought to know about the magic number so that it
> can compare with whatever it reads off disk.
Good point, it should know the magic number to compare with the header, if we decide to keep the geometry incore only.... I started to wonder if we really need to store geometry info on disk...

> (Thinking ahead to when we have multiple btrees with different record
> definitions and magic numbers.)
Agree...

>> +
>> +/*
>> + * Every btree block starts from a header structure, including the index node
>> + * and the leaf node.
>> + * The index btree node started from this header structure, followed by 
>> + * (key,value) pairs
>> + * Index node:
>> + * ---------------------------------------------------------------------------
>> + * |header| key | key | key | key|...........| value | value | value | value |
>> + * |--------------------------------------------------------------------------
>> + *
>> + * And the leaf node begins with ext4_btree_header, and followed by records.
>> + * leaf node
>> + * * ------------------------------------------------------------------------
>> + * |header| rec | rec | rec | rec |...........| rec | rec | rec | rec | rec |
>> + * |-------------------------------------------------------------------------
>> + */
>> +
>> +#define REF_BTREE_MAGIC  0x52454642 /* "REFB" magic number for refcount btree */
>> +
>> +struct ext4_btree_header {
>> +	__le32	btree_magic;	/* type of btree */
>> +	__le32	btree_generation; /* generation for this type of btree */
>> +	__le32	btree_level;	/* the level of this node in the tree */
>> +	__le32	btree_numkeys;	/* # of records for leaf node*/
>> +	__le32	btree_padding1;
> I think this padding field results in a hidden u32 gap here?

Sorry that was a mistake. I added generation but forget to remove this padding.
>> +	__le64	btree_blockno;	/* remember the blk# of this node */
> (Unfortunate that you have to burn all 64 bits for all btrees, but I remembered
> that ext4 lets you set flexbg size to 2^31.  Oh well.)

true... though this block no is
>> +	__le64	btree_leftsib;	/* used for fast search sibling nodes */
>> +	__le64	btree_rightsib; 
>> +	__le64	btree_crc;	/* this node checksums */
> Future proofing, I guess?  Since we don't use 64-bit crcs right now.

yes... that was the intention... I guess that is a little overkill for now.
> If you decide that we only need 32 bits for a crc, you could make btree_level a
> u16 (since max levels is 8), numkeys a u16, and store the crc32 after that.
> Then this on-disk header would only be 40 bytes, instead of 64 bytes.

Sounds good to me .. make it more packed...
> (Hmm.  Maybe you were trying to make things align to a 64B cacheline?)
>
>> +	__le64	btree_padding2;
>> +};
>> +
>> +# define EXT4_BLOCK_SIZE	4096
> Shouldn't this be the FS blocksize?  It'll waste a lot of space on a 64k block
> filesystem (reduced fanout opportunities as well as 60K of empty space in each
> tree node) and I'm not sure what happens with 1k blocks.

Again, sorry this is me being lazy in the prototype here. I will fix it properly with filesystem blocksize ...

>> +#define EXT4_BTREE_HEADER_SIZE	sizeof(struct ext4_btree_header)
>> +#define EXT4_BTREE_NODESIZE	EXT4_BLOCK_SIZE	
>> +
>> +struct ext4_btree_key {
>> +	__le64		blockno;
>> +};
>> +#define EXT4_BTREE_KEY_SIZE	sizeof(struct ext4_btree_key)
>> +
>> +struct ext4_btree_val {
>> +	__le64		blockno;
>> +};
>> +#define EXT4_BTREE_VAL_SIZE	sizeof(struct ext4_btree_val)
>> +
>> +struct ext4_btree_rec {
>> +	struct ext4_btree_key	key;
>> +	__le32			len;
>> +	__le32			ref_count;
>> +};
> Hm.  Looking forward to having btrees to track things other than refcount, I
> imagine it'll become necessary to parameterize the record definition for each
> type of btree?
>
totally agree. Will need to make it more generic by adding interface to define different type of btree, with own record_data structure and let the btree user to pass those info into the generic code..

> (Though, I guess this record format works well enough for the inode and block
> bitmaps, and maybe that's as far as you want to go...)

for reflink, block allocation, yes. But I like to make the code more generic if possible
>> +#define EXT4_BTREE_REC_SIZE	sizeof(struct ext4_btree_rec)
>> +
>> +#define EXT4_BTREE_MAX_KEY_VAL_PAIRS \
>> +	((EXT4_BTREE_NODESIZE - EXT4_BTREE_HEADER_SIZE) / \
>> +	 (EXT4_BTREE_KEY_SIZE + EXT4_BTREE_VAL_SIZE))
>> +
>> +#define EXT4_BTREE_MIN_KEY_VAL_PAIRS \
>> +        (EXT4_BTREE_MAX_KEY_VAL_PAIRS / 2)
>> +
>> +#define EXT4_BTREE_MAX_RECS \
>> +	((EXT4_BTREE_NODESIZE - EXT4_BTREE_HEADER_SIZE) / \
>> +	  EXT4_BTREE_REC_SIZE)
>> +
>> +#define EXT4_BTREE_MIN_RECS \
>> +	(EXT4_BTREE_MAX_RECS / 2)
>> +
>> +#define EXT4_BTREE_MAX_LEVELS		8
>> +
>> +
>> +/* Index node */
>> +struct ext4_btree_index_node {
>> +	struct ext4_btree_header	node_header;
>> +	struct ext4_btree_key		key[EXT4_BTREE_MAX_KEY_VAL_PAIRS];
>> +	struct ext4_btree_val		val[EXT4_BTREE_MAX_KEY_VAL_PAIRS];
> (Changing EXT4_BTREE_NODESIZE to depend on the FS blocksize makes it impossible
> to use arrays here, oh well...)

Hmm.. okay, probably a pointer here.
>> +};
>> +
>> +/* Record Node */
>> +struct ext4_btree_leaf_node {
>> +	struct ext4_btree_header	node_header;
>> +	struct ext4_btree_rec		rec[EXT4_BTREE_MAX_RECS];
>> +};
>> +
>> +struct ext4_btree_node {
>> +	struct ext4_btree_header	node_header;
>> +};
>> +
>> +struct ext4_btree_root {
>> +	struct ext4_btree_node *node;
>> +	struct ext4_btree_geo	geo;
>> +};
>> +
>> +#define ext4_btree_trace printf
> printk?
>
> --D

:) this was done in the user-space first. When the code finally all move to kernel will be something like printk

Thanks a lot , for your review and comments!:)

Mingming

  reply	other threads:[~2015-06-24 10:07 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-23  3:24 [RFC 0/2] ext4 btree mingming cao
2015-06-23  3:24 ` [RFC 1/2] btree header mingming cao
2015-06-23 19:33   ` Darrick J. Wong
2015-06-24  4:14     ` Mingming Cao [this message]
2015-06-24  5:21       ` Darrick J. Wong
2015-06-23  3:24 ` [RFC 2/2] ext4 btree basic implementation mingming cao
2015-06-23 23:02   ` Darrick J. Wong
2015-06-29 22:08     ` mingming cao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=558A2E9B.6080808@oracle.com \
    --to=mingming.cao@oracle.com \
    --cc=darrick.wong@oracle.com \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.