Crazy idea of cleanup the inode_record btrfsck things with SQL?

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* Crazy idea of cleanup the inode_record btrfsck things with SQL?
@ 2014-12-01  1:58 Qu Wenruo
  2014-12-01  3:08 ` Duncan
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Qu Wenruo @ 2014-12-01  1:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

[BACKGROUND]
I'm trying to implement the function to repair missing inode item.
Under that case, inode type must be salvaged(although it can be fallback to
FILE).

One case should be, if there is any dir_item/index or inode_ref refers the
inode as parent, the type of that inode must be DIR.

However, currently btrfsck implement (inode_record only records backref), we
are unable to search the inode_backref whose parent is given inode number.

[FIRST IMPLEMENT DESIGN]
My first thought is to implement an generic inode-relation structure,
recording parent ino, child ino, name and namelen, and restore the structure
in a rbtree, not in the child/parent's list.

But I soon recognize that this is a perfect use case for relational 
database,
as 'ino' as the primary key for INODE table,
('parent_ino', 'child_ino', 'name') as the primary key for INODE_REF table.

[CRAZY IDEA]
So why not using SQL to implement the btrfsck inode-record things?

With such crazy idea, it will be much much easier to do any iteration from a
given ino, and with the already mature RDB implement, like sqlite3, we can
save hundreds of lines of codes implementing the rb-tree or list.

[PROS]
1. Easy to maintain
    Now we don't need to maintain the rbtree searching or list 
iteration, but
    easy SQL lines and its wrapper.

2. Easy to extend
    If we need to record something more, like extents and its relation to
    inode, we only need to create 2 tables and several SQL and wrappers.

3. Reduced memory usage for HUGE fs.
    When metadata grows to several TB or even more, current rb-tree based
    implement may run short of memory since they are all stored in memory.
    But if use SQL, RDBMS like sqlite3 can restore things in either 
memory or
    disk, which may hugely reduce the memory usage for huge btrfs.

    If not use existing RDBMS, we need to implement complicated memory 
control
    system to manage memory in userland.

[CONS]
1. Heavy implement
    SQL hide the rb-tree or B+ tree implement but costs more memory(if not
    compressed) and CPU cycles, which will be slower than the simple rb-tree
    implement even using lightweight RDBMS like sqlite3.

2. Heavy dependency
    If use it, btrfs-progs will include RDBMS as the make and runtime
    dependency.
    Such low level progs depend on high level programs like sqlite3 may 
be very
    strange.

3. A lot of rework on existing codes.
    Even SQL is easier to maintain and extend, if we use it, we still 
need to
    reimplement several hundreds or even thousands lines of code to 
implement
    it, not to mention the regression tests.

4. Copyright
    Will it cause any copyright problem if using non-GPL RDBMS like 
sqlite3 in
    GPLv2 btrfs-progs?

[NEED FEEDBACK]
Any feedback or discussion on the crazy idea is welcomed, since this may 
needs
a lot of work, it definitely needs a lot review on the idea before it 
comes to
codes.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-01  1:58 Crazy idea of cleanup the inode_record btrfsck things with SQL? Qu Wenruo
@ 2014-12-01  3:08 ` Duncan
  2014-12-01  3:24   ` Qu Wenruo
  2014-12-01  4:03 ` Robert White
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Duncan @ 2014-12-01  3:08 UTC (permalink / raw)
  To: linux-btrfs

Qu Wenruo posted on Mon, 01 Dec 2014 09:58:27 +0800 as excerpted:

> [CRAZY IDEA]
> So why not using SQL to implement the btrfsck inode-record things?

> 2. Heavy dependency
>     If use it, btrfs-progs will include RDBMS as the make and runtime
>     dependency.  Such low level progs depend on high level programs
>     like sqlite3 may be very strange.

I expect this will turn many of the "traditionalists" off, at least.  I 
could see a lot of traditional sysadmins lumping btrfs in with systemd if 
it started requiring a db, much as one of the big objections to systemd 
is the dbus requirement... even for headless servers that have never 
required it before.  Of course they could be ignored, but do we really 
want to go there?

(Personally, my gut reaction is "eew", and of course getting database 
file handling correct after an ungraceful shutdown/reboot is one of the 
big challenges for a filesystem as it is, so I'm not entirely sure 
storing information in a database file in ordered to use it to help fix 
the filesystem is a good idea since it could well be that you end up 
needing an fsck to restore the file... to do the fsck, but I could be 
convinced.  I'm worried about the ones that can't be.)

> 4. Copyright
>     Will it cause any copyright problem if using non-GPL RDBMS like
>     sqlite3 in GPLv2 btrfs-progs?

I just checked and at least on gentoo, sqlite's license is registered as 
"public domain", which is legally mergeable with code under any other 
license free or proprietary, so there should be no problem with it.  If 
something else is used of course it would depend on its license.

I believe the general kernel-rules practice for such compatible license 
merging is to keep code under compatible licenses in separate files and 
keep the individual files under their individual licenses.  While if it's 
compatible I don't believe that's generally an actual legal requirement, 
the BSD folks in particular tend to be /very/ sensitive about code 
formerly under the BSD license, for instance, merged directly into GPL 
headlined files, because in that case they can't reverse the process.  
Personally I don't see the big deal, since they seem to have /no/ problem 
with their code being taken proprietary where they can't even /look/ at 
it, but a /huge/ problem with it being taken GPL, where they may not be 
able to directly copy back to BSD, but they obviously have the code 
available to look at anyway and still mergeable provided they dual 
license, and that makes absolutely no sense to me. <shrug>

So while I don't think it'll go over well, there should be no license 
issues at least with sqlite.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-01  3:08 ` Duncan
@ 2014-12-01  3:24   ` Qu Wenruo
  2014-12-01  5:47     ` Duncan
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2014-12-01  3:24 UTC (permalink / raw)
  To: Duncan, linux-btrfs


-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Duncan <1i5t5.duncan@cox.net>
To: <linux-btrfs@vger.kernel.org>
Date: 2014年12月01日 11:08
> Qu Wenruo posted on Mon, 01 Dec 2014 09:58:27 +0800 as excerpted:
>
>> [CRAZY IDEA]
>> So why not using SQL to implement the btrfsck inode-record things?
>> 2. Heavy dependency
>>      If use it, btrfs-progs will include RDBMS as the make and runtime
>>      dependency.  Such low level progs depend on high level programs
>>      like sqlite3 may be very strange.
> I expect this will turn many of the "traditionalists" off, at least.  I
> could see a lot of traditional sysadmins lumping btrfs in with systemd if
> it started requiring a db, much as one of the big objections to systemd
> is the dbus requirement... even for headless servers that have never
> required it before.  Of course they could be ignored, but do we really
> want to go there?
Oh,so terrible the systemd warfare. :(
This objection sounds very solid now.

Anyway, this is a crazy idea...
(Maybe it is only me so lazy to implement the rb-tree based things)
>
> (Personally, my gut reaction is "eew", and of course getting database
> file handling correct after an ungraceful shutdown/reboot is one of the
> big challenges for a filesystem as it is, so I'm not entirely sure
> storing information in a database file in ordered to use it to help fix
> the filesystem is a good idea since it could well be that you end up
> needing an fsck to restore the file... to do the fsck, but I could be
> convinced.  I'm worried about the ones that can't be.)
The db file is mostly used in memory, only when the metadata is really 
really big, maybe when the fs tree's level
is 7 or 8 we may need to use db file.
And the db file should be anonymous(unlinked but open) since it is only 
used in one btrfsck session,
not reused or really needed to be stored.
So I will not be a problem of restore or something like that.

>
>> 4. Copyright
>>      Will it cause any copyright problem if using non-GPL RDBMS like
>>      sqlite3 in GPLv2 btrfs-progs?
> I just checked and at least on gentoo, sqlite's license is registered as
> "public domain", which is legally mergeable with code under any other
> license free or proprietary, so there should be no problem with it.  If
> something else is used of course it would depend on its license.
>
> I believe the general kernel-rules practice for such compatible license
> merging is to keep code under compatible licenses in separate files and
> keep the individual files under their individual licenses.  While if it's
> compatible I don't believe that's generally an actual legal requirement,
> the BSD folks in particular tend to be /very/ sensitive about code
> formerly under the BSD license, for instance, merged directly into GPL
> headlined files, because in that case they can't reverse the process.
> Personally I don't see the big deal, since they seem to have /no/ problem
> with their code being taken proprietary where they can't even /look/ at
> it, but a /huge/ problem with it being taken GPL, where they may not be
> able to directly copy back to BSD, but they obviously have the code
> available to look at anyway and still mergeable provided they dual
> license, and that makes absolutely no sense to me. <shrug>
>
>
> So while I don't think it'll go over well, there should be no license
> issues at least with sqlite.
>
Thanks for the license check anyway.

Thanks,
Qu.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-01  1:58 Crazy idea of cleanup the inode_record btrfsck things with SQL? Qu Wenruo
  2014-12-01  3:08 ` Duncan
@ 2014-12-01  4:03 ` Robert White
  2014-12-01  6:18   ` Qu Wenruo
  2014-12-01 12:53 ` Austin S Hemmelgarn
  2014-12-11 19:38 ` Roger Binns
  3 siblings, 1 reply; 18+ messages in thread
From: Robert White @ 2014-12-01  4:03 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: David Sterba

On 11/30/2014 05:58 PM, Qu Wenruo wrote:
> ("why not use SQL to..." suggestion)

SQL, as in Structured Query Language, is _terrible_ for recursion. It 
expresses all of its elements in terms of set theory and really can only 
implement union and intersection of flat sets.

Several companies offer extensions to SQL in their implementations to 
help with this lack of recursion such as "prior" in Oracle's PSQL, but 
they are all stateful beyond reason.

Several companies, including microsoft, have proposed and partially 
implemented "a relational database as a file system" paradigm and then 
crashed into the fact that dealing with the parent of the parent of 
something is different than dealing with the parent of the parent of the 
parent of something.

There is a humours-but-true saying: "If you have a problme, and you 
decide to solve it with (regex or xml or uml or sql etc) you now have 
two problems."

Writing the SQL to walk the tree is harder than allocating the memory as 
a vector, filling it with the data, and then walking the pointers.

Your suggestion is the first step on the road to The Inner Platform 
Effect™. You have a specialized database (parent, inode, name) and now 
you want to put a generic database engine over the specialized database 
so that you an re-implement the specialized database with generic 
primitives.

http://en.wikipedia.org/wiki/Inner-platform_effect

Things need to be only as generic as they need to be, and no more 
generic than that.

Replacing a pointer to a record with a pointer to a cursor's result 
table that will give you the name of the next result to query is not a 
win. Even as you spell it out you can see that it is _not_ a reduction 
in memory or processing.

And the "easy SQL lines" stop being that easy when "name" stops being 
unique.

(I've been down this road before. Not with file systems but with 
"managed objects" in a network management system. Nodes, Parent nodes, 
etc. Just referring to distributed things like networks switches instead 
of file system inodes. ... It doesn't end well. 8-) )

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-01  3:24   ` Qu Wenruo
@ 2014-12-01  5:47     ` Duncan
  2014-12-01  6:25       ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Duncan @ 2014-12-01  5:47 UTC (permalink / raw)
  To: linux-btrfs

Qu Wenruo posted on Mon, 01 Dec 2014 11:24:50 +0800 as excerpted:

> The db file is mostly used in memory, only when the metadata is really
> really big, maybe when the fs tree's level is 7 or 8 we may need to use
> db file.

So fscking the database in ordered to fsck the database isn't an issue.  
One objection down! =:^)

But seriously, the politics of the idea remains its biggest nemesis in my 
opinion.  And in systemd we've unfortunately a live demonstration of just 
how big a nemesis that can be.  =:^(  If the technical reasoning for it 
is sound and the benefit high enough, great, but IMO the benefit will 
need to be pretty high to justify the risk of political fallout, and I 
doubt it's anything close to that high.  But it's not my call, so we'll 
see.  Thinks could certainly get interesting if it's judged to be worth 
it.  < Checking popcorn stash >

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-01  4:03 ` Robert White
@ 2014-12-01  6:18   ` Qu Wenruo
  2014-12-01 18:10     ` Robert White
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2014-12-01  6:18 UTC (permalink / raw)
  To: Robert White, linux-btrfs; +Cc: David Sterba


-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Robert White <rwhite@pobox.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, linux-btrfs 
<linux-btrfs@vger.kernel.org>
Date: 2014年12月01日 12:03
> On 11/30/2014 05:58 PM, Qu Wenruo wrote:
>> ("why not use SQL to..." suggestion)
>
> SQL, as in Structured Query Language, is _terrible_ for recursion. It 
> expresses all of its elements in terms of set theory and really can 
> only implement union and intersection of flat sets.
>
> Several companies offer extensions to SQL in their implementations to 
> help with this lack of recursion such as "prior" in Oracle's PSQL, but 
> they are all stateful beyond reason.
>
> Several companies, including microsoft, have proposed and partially 
> implemented "a relational database as a file system" paradigm and then 
> crashed into the fact that dealing with the parent of the parent of 
> something is different than dealing with the parent of the parent of 
> the parent of something.
>
> There is a humours-but-true saying: "If you have a problme, and you 
> decide to solve it with (regex or xml or uml or sql etc) you now have 
> two problems."
Wait, regex and uml and xml is OK, but never heard sql is one of them...
>
> Writing the SQL to walk the tree is harder than allocating the memory 
> as a vector, filling it with the data, and then walking the pointers.
In fact, such INODE and INODE_REF table is not (completely nor mainly) 
used to walk the tree,
it is mainly used to search for:
1. is there any inode_ref refers to a given ino as parent.

This will not even be a problem when the fs is *OK*, since do a simple 
btrfs_search_slot()
with key( objectied = ino, type = BTRFS_DIR_INDEX/ITEM_KEY, offset = 0) 
will do it.

However when it comes to corrupted leaf, the whole INODE_ITEM with its 
DIR_INDEX/ITEM are gone
with the leaf, so the old search way is not usable and btrfs-progs will 
relay on other mechanism
to determine that.
And unfortunately, there is no such mechanism.


2. is there any dir_index/dir_item refers to a given ino as child.
Current inode_record works fine for this object.

So when the crazy idea disappear and sane ideas come back, it will 
probably be rb-tree based
(parent, ino, name, namelen) entries to record parent-child relation
(currently it is a list_head only records backref inside the inode_record).

And another rb-tree based (ino) entries (same as current inode_record 
structure).
>
> Your suggestion is the first step on the road to The Inner Platform 
> Effect™. You have a specialized database (parent, inode, name) and now 
> you want to put a generic database engine over the specialized 
> database so that you an re-implement the specialized database with 
> generic primitives.
>
> http://en.wikipedia.org/wiki/Inner-platform_effect
>
> Things need to be only as generic as they need to be, and no more 
> generic than that.
>
> Replacing a pointer to a record with a pointer to a cursor's result 
> table that will give you the name of the next result to query is not a 
> win. Even as you spell it out you can see that it is _not_ a reduction 
> in memory or processing.
>
> And the "easy SQL lines" stop being that easy when "name" stops being 
> unique.
Name is still unique when parent ino is given, so the INODE_REF tables' 
primary key is not
name but the (parent, ino, name) combine.

But the inner platform effect still seems valid for my crazy idea.
Anyway, the crazy idea comes to me when I see the RDB like feature in 
the inode_record structure,
-and I just want to save sometime coding the new (parent, ino, name, 
namelen) rb-tree-.
>
> (I've been down this road before. Not with file systems but with 
> "managed objects" in a network management system. Nodes, Parent nodes, 
> etc. Just referring to distributed things like networks switches 
> instead of file system inodes. ... It doesn't end well. 8-) )
>
The RDB idea must come to you just like me, wanting to write less codes, 
right?
So it seems the end may be the same. :-(

Thanks,
Qu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-01  5:47     ` Duncan
@ 2014-12-01  6:25       ` Qu Wenruo
  0 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2014-12-01  6:25 UTC (permalink / raw)
  To: Duncan, linux-btrfs


-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Duncan <1i5t5.duncan@cox.net>
To: <linux-btrfs@vger.kernel.org>
Date: 2014年12月01日 13:47
> Qu Wenruo posted on Mon, 01 Dec 2014 11:24:50 +0800 as excerpted:
>
>> The db file is mostly used in memory, only when the metadata is really
>> really big, maybe when the fs tree's level is 7 or 8 we may need to use
>> db file.
> So fscking the database in ordered to fsck the database isn't an issue.
> One objection down! =:^)
>
> But seriously, the politics of the idea remains its biggest nemesis in my
> opinion.  And in systemd we've unfortunately a live demonstration of just
> how big a nemesis that can be.  =:^(  If the technical reasoning for it
> is sound and the benefit high enough, great, but IMO the benefit will
> need to be pretty high to justify the risk of political fallout, and I
> doubt it's anything close to that high.  But it's not my call, so we'll
> see.  Thinks could certainly get interesting if it's judged to be worth
> it.  < Checking popcorn stash >
>
No (systemd civilian) war, make love!

I seldom consider the politics problem of the insane idea even the 
systemd warfare is still here.
(Emmm, maybe it is because Arch accept systemd too long time ago without 
too much problem?)

Anyway, it is just an insane idea and any feedback killing the seed of 
insanity as soon as possible is welcomed.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-01  1:58 Crazy idea of cleanup the inode_record btrfsck things with SQL? Qu Wenruo
  2014-12-01  3:08 ` Duncan
  2014-12-01  4:03 ` Robert White
@ 2014-12-01 12:53 ` Austin S Hemmelgarn
  2014-12-02  0:37   ` Qu Wenruo
  2014-12-11 19:38 ` Roger Binns
  3 siblings, 1 reply; 18+ messages in thread
From: Austin S Hemmelgarn @ 2014-12-01 12:53 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3533 bytes --]

On 2014-11-30 20:58, Qu Wenruo wrote:
> [BACKGROUND]
> I'm trying to implement the function to repair missing inode item.
> Under that case, inode type must be salvaged(although it can be fallback to
> FILE).
>
> One case should be, if there is any dir_item/index or inode_ref refers the
> inode as parent, the type of that inode must be DIR.
>
> However, currently btrfsck implement (inode_record only records
> backref), we
> are unable to search the inode_backref whose parent is given inode number.
>
> [FIRST IMPLEMENT DESIGN]
> My first thought is to implement an generic inode-relation structure,
> recording parent ino, child ino, name and namelen, and restore the
> structure
> in a rbtree, not in the child/parent's list.
>
> But I soon recognize that this is a perfect use case for relational
> database,
> as 'ino' as the primary key for INODE table,
> ('parent_ino', 'child_ino', 'name') as the primary key for INODE_REF table.
>
> [CRAZY IDEA]
> So why not using SQL to implement the btrfsck inode-record things?
>
> With such crazy idea, it will be much much easier to do any iteration
> from a
> given ino, and with the already mature RDB implement, like sqlite3, we can
> save hundreds of lines of codes implementing the rb-tree or list.
>
> [PROS]
> 1. Easy to maintain
>     Now we don't need to maintain the rbtree searching or list
> iteration, but
>     easy SQL lines and its wrapper.
>
> 2. Easy to extend
>     If we need to record something more, like extents and its relation to
>     inode, we only need to create 2 tables and several SQL and wrappers.
>
> 3. Reduced memory usage for HUGE fs.
>     When metadata grows to several TB or even more, current rb-tree based
>     implement may run short of memory since they are all stored in memory.
>     But if use SQL, RDBMS like sqlite3 can restore things in either
> memory or
>     disk, which may hugely reduce the memory usage for huge btrfs.
>
>     If not use existing RDBMS, we need to implement complicated memory
> control
>     system to manage memory in userland.
>
> [CONS]
> 1. Heavy implement
>     SQL hide the rb-tree or B+ tree implement but costs more memory(if not
>     compressed) and CPU cycles, which will be slower than the simple
> rb-tree
>     implement even using lightweight RDBMS like sqlite3.
>
> 2. Heavy dependency
>     If use it, btrfs-progs will include RDBMS as the make and runtime
>     dependency.
>     Such low level progs depend on high level programs like sqlite3 may
> be very
>     strange.
>
> 3. A lot of rework on existing codes.
>     Even SQL is easier to maintain and extend, if we use it, we still
> need to
>     reimplement several hundreds or even thousands lines of code to
> implement
>     it, not to mention the regression tests.
>
> 4. Copyright
>     Will it cause any copyright problem if using non-GPL RDBMS like
> sqlite3 in
>     GPLv2 btrfs-progs?
>
> [NEED FEEDBACK]
> Any feedback or discussion on the crazy idea is welcomed, since this may
> needs
> a lot of work, it definitely needs a lot review on the idea before it
> comes to
> codes.
>
So, I think this does a good job of highlighting one of the bigger 
issues with btrfsck when it is compared to ext* and/or xfs.  Despite 
this being a problem, I really don't think using a rdbms is the way to 
fix it, both for reasons outlined in other responses, and because fsck 
should be as fast as possible when nothing is wrong with the fs.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-01  6:18   ` Qu Wenruo
@ 2014-12-01 18:10     ` Robert White
  2014-12-02  1:17       ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Robert White @ 2014-12-01 18:10 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: David Sterba

On 11/30/2014 10:18 PM, Qu Wenruo wrote:
> (advocacy for using SQL internally for btrfsck)

All of these ideas you want to toss a entire SQL front end on are more 
simply handled with simple data structures.

In C++ terms "map<inode,parent>" and/or "map<parent,vector<children>>" 
beats the heck out of including all of SQL and its related indexes and 
type conversions (sqlite, for example, stores integers as doubles, or 
decimal numbers depending on version).

RDBMS _are_ good at representing things, so noticing that a thing _can_ 
be represented with an RDBMS is very common.

But by the time you put two or three indexes on relation->(parent, 
child, name) you've given yourself three or four copies of the core data 
in three or four different places. And those copies are largely 
immutable and randomly distributed and will include the overhead in 
memory for fairly sparse trees.

It's not that it's an unworkable idea.

But it is unnecessarily generic and adds an order of magnitude of 
complexity to your problems.

For instance, if I boot from a CD to run a btrfsck where will the 
database files be written to?

If it is an in-memory table why do I want the overhead of SQL to look up 
something indexed by integer?

If the sparse vectors of integers don't fit in memory why would the SQL 
tables of integers fit "better"?

SQL would be the second slowest possible for representing this data -- 
The slowest would be an XML schema stored as flat text.

So your crazy ides is also a pretty bad one compared to most if not all 
sparse data representations and techniques that come to bear on this 
problem set. All you are really doing is pushing the same work (walking 
a tree to find an integer) into a difficult "spell it out in SQL" space.

Is prepare_sql(curosr,"SELECT parent FROM parantage_tree WHERE child = 
%d"); execute_sql(cursor,child); and its possible error returns actually 
clearer or better than "parent=inheretance.find(child); if 
(parent!=inheretance.end()) {...}" (as it might be written in C++)?

Do you want to know if (keep track of whether) an inode is allocated and 
referenced? There's a sparse bit-vector for that...

Want to be able to get back to an inode's location on disk, a sparse 
array of disk offsets exists (among other options).

Before you can even access the RDBMS you'd have to fill it completely; 
otherwise you wouldn't know if a select returning zero rows was an 
authoritative indication that the datum didn't exist or if it was 
instead an indication that the datum hadn't been populated yet.

THIS IS NOT SARCASM: If you strongly disagree, I suggest you start 
coding. Seriously, don't ask, do... And in a month really check to see 
if your solution is any smaller, faster, easier, or in _any_ _way_ more 
optimal than using native data structures. The attempt will answer the 
question definitively and then we'll all know...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-01 12:53 ` Austin S Hemmelgarn
@ 2014-12-02  0:37   ` Qu Wenruo
  2014-12-11 19:00     ` Martin Steigerwald
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2014-12-02  0:37 UTC (permalink / raw)
  To: Austin S Hemmelgarn, linux-btrfs


-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, linux-btrfs 
<linux-btrfs@vger.kernel.org>
Date: 2014年12月01日 20:53
> On 2014-11-30 20:58, Qu Wenruo wrote:
>> [snipped]
>>
> So, I think this does a good job of highlighting one of the bigger 
> issues with btrfsck when it is compared to ext* and/or xfs. Despite 
> this being a problem, I really don't think using a rdbms is the way to 
> fix it, both for reasons outlined in other responses, and because fsck 
> should be as fast as possible when nothing is wrong with the fs.
>
>
Although I am not stick to the crazy idea, I think it is still needed to 
point out that,
even btrfsck is ran on a clean btrfs, it still needs to iterate all the 
extents, metadata.
So it may not be as fast as you thought even with the current implement.

Anyway thanks for the feedback.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-01 18:10     ` Robert White
@ 2014-12-02  1:17       ` Qu Wenruo
  2014-12-03 19:18         ` Robert White
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2014-12-02  1:17 UTC (permalink / raw)
  To: Robert White, linux-btrfs; +Cc: David Sterba


-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Robert White <rwhite@pobox.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, linux-btrfs 
<linux-btrfs@vger.kernel.org>
Date: 2014年12月02日 02:10
> On 11/30/2014 10:18 PM, Qu Wenruo wrote:
>> (advocacy for using SQL internally for btrfsck)
>
> All of these ideas you want to toss a entire SQL front end on are more 
> simply handled with simple data structures.
>
> In C++ terms "map<inode,parent>" and/or "map<parent,vector<children>>" 
> beats the heck out of including all of SQL and its related indexes and 
> type conversions (sqlite, for example, stores integers as doubles, or 
> decimal numbers depending on version).
>
> RDBMS _are_ good at representing things, so noticing that a thing 
> _can_ be represented with an RDBMS is very common.
>
> But by the time you put two or three indexes on relation->(parent, 
> child, name) you've given yourself three or four copies of the core 
> data in three or four different places. And those copies are largely 
> immutable and randomly distributed and will include the overhead in 
> memory for fairly sparse trees.
>
> It's not that it's an unworkable idea.
>
> But it is unnecessarily generic and adds an order of magnitude of 
> complexity to your problems.
>
> For instance, if I boot from a CD to run a btrfsck where will the 
> database files be written to?
This is easy, memory.
Since only when we judge the fs' metadata is too huge then we will use file.

One of the problem in current inode_record is, btrfsck can only record 
them all in memory,
when metadata of the file system is too big, sysadmin can only add swap 
space or memory
to handle it.

Although it is not a urgent problem, since 1T btrfs fs with about 5G 
metadata will only takes about 500M
checking chunk and extent and even less for checking fs roots.
>
> If it is an in-memory table why do I want the overhead of SQL to look 
> up something indexed by integer?
>
> If the sparse vectors of integers don't fit in memory why would the 
> SQL tables of integers fit "better"?
>
> SQL would be the second slowest possible for representing this data -- 
> The slowest would be an XML schema stored as flat text.
>
> So your crazy ides is also a pretty bad one compared to most if not 
> all sparse data representations and techniques that come to bear on 
> this problem set. All you are really doing is pushing the same work 
> (walking a tree to find an integer) into a difficult "spell it out in 
> SQL" space.
>
> Is prepare_sql(curosr,"SELECT parent FROM parantage_tree WHERE child = 
> %d"); execute_sql(cursor,child); and its possible error returns 
> actually clearer or better than "parent=inheretance.find(child); if 
> (parent!=inheretance.end()) {...}" (as it might be written in C++)?
>
> Do you want to know if (keep track of whether) an inode is allocated 
> and referenced? There's a sparse bit-vector for that...
>
> Want to be able to get back to an inode's location on disk, a sparse 
> array of disk offsets exists (among other options).
>
> Before you can even access the RDBMS you'd have to fill it completely; 
> otherwise you wouldn't know if a select returning zero rows was an 
> authoritative indication that the datum didn't exist or if it was 
> instead an indication that the datum hadn't been populated yet.
>
> THIS IS NOT SARCASM: If you strongly disagree, I suggest you start 
> coding. Seriously, don't ask, do... And in a month really check to see 
> if your solution is any smaller, faster, easier, or in _any_ _way_ 
> more optimal than using native data structures. The attempt will 
> answer the question definitively and then we'll all know...
I know this is a crazy idea and not disagree with your opinion.
But I am also somewhat tired of bringing new structure new searching 
functions or even bring larger change on
the btrfsck record infrastructure when I found that can't provide the 
function when new recovery function is going
to be implemented.

In fact, after I implement the whole corrupted-leaf recovery patchset, I 
may try to implement it as an experimental
try-and-error for cleanup/enhance for the inode_record infrastructure 
and see if there is the huge performance drop
or the lines of code reduced(anyway, just a personal try-and-error, will 
not send them if there is no such interesting
result, and it may be highly possible a disaster as you mentioned)

Thanks,
Qu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-02  1:17       ` Qu Wenruo
@ 2014-12-03 19:18         ` Robert White
  2014-12-04  6:56           ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Robert White @ 2014-12-03 19:18 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: David Sterba

[-- Attachment #1: Type: text/plain, Size: 6077 bytes --]

On 12/01/2014 05:17 PM, Qu Wenruo wrote:
> But I am also somewhat tired of bringing new structure new searching
> functions or even bring larger change on
> the btrfsck record infrastructure when I found that can't provide the
> function when new recovery function is going
> to be implemented.

DISCLAIMER: My actual thing is modeling, so coming from the idea that 
btrfsck is basically building a model of the filesystem and comparing 
the theoretically correct model to the actual filesystem as found... and 
having spent nearly zero time inside of the btrfsck code base itself...

I've looked at what I _think_ you are trying to do with the 
lookup/correlate phase finding peer/parent/child references and I'd 
think other things would make sense far more sense than SQL.

C++ Standard Template Library at a minimum, life is too short to 
reinvent data lookups in memory. But its also too short to re-factor SQL 
queries or wait for degenerate full-table scans.

The boost "multi-index container", if you are going to bring in an 
external library, would do you far better for dealing with future 
structural changes. It would directly index the fields of your 
(parent,child,name) relation and let you switch iteration on-the-fly 
(e.g. you can be walking parents and use the same iterator to start 
walking names etc). (there's even a cookbook item for adding an LRU 
cache as a natural index for space-constrained items.) Again you'd have 
to switch over to C++, but it implements exactly what I think you are 
looking for. the multi-index container declaration syntax gets a little 
thick but once declared the usage model is really simple. (you pay 
most-or-all the cost during "update" which is really a remove-then-readd 
but you won't be updating as often as adding so...)

My "crazy-ist idea" would be to maybe mmap all the metadata from the 
filesystem into the process space and build a page-table-like index over 
that. The virtual memory demand paging will become your caching buddy. 
mapped_block+offest gets you directly to the whole native structure. You 
might have to juggle mappings (map and unmap them) on systems with arger 
filesystems but smaller virtual tables (e.g. classic pentium 32 bit). 
Even then, you'd be juggling mappings and the kernel would be doing all 
the real caching. The more data you can leave in the mmapped ranges the 
less storage management takes place in the local code.

IMHO, finding a USB stick to make a swap space on is probably just as 
good, if not better than, doing it with a whole other filesystem and the 
"put my temp files there" option during emergency maintenance.


If the mailing list engine honors attachments, you will find 
FileEntries.h from one of my projects, offered as a simplified example 
of a directory entry cache implemented in a boost multi-index container.

(Note that this is fully working code from an upcoming addition to my 
soruceforge project, so it's not just theoretical implementation ideas.)

struct FileEntry {
   struct NameIndex {};
   struct DirectoryIndex {};
   struct INodeIndex {};
   struct TypedIndex {};
   dev_t         Device;
   long          ParentINode;
   long          INode;
   std::string   Name;
   enum FileType Type;
};

The empty structs inside the struct act as index identifiers/names 
(Because C++ templates only disambiguate on types so 
FileEntry::NameIndex is the type-name of the index of file names (etc).

Any other data could be added to FileEntry without disrupting any 
existing code for indexing or iterating.

The container itself is a little grusome textually ::

typedef multi_index_container<
   FileEntry,
   indexed_by<
     ordered_non_unique<
       tag<FileEntry::NameIndex>,
       member<
         FileEntry,
         std::string,
         &FileEntry::Name
       >
     >,
     ordered_non_unique<
       tag<FileEntry::DirectoryIndex>,
       composite_key<
         FileEntry,
         member<
           FileEntry,
           dev_t,
           &FileEntry::Device
         >,
         member<
           FileEntry,
           long,
           &FileEntry::ParentINode
         >
       >
     >,
     ordered_non_unique<
       tag<FileEntry::INodeIndex>,
       composite_key<
         FileEntry,
         member<
           FileEntry,
           dev_t,
           &FileEntry::Device
         >,
         member<
           FileEntry,
           long,
           &FileEntry::INode
         >
       >
     >,
     ordered_non_unique<
       tag<FileEntry::TypedIndex>,
       composite_key<
         FileEntry,
         member<
           FileEntry,
           enum FileType,
           &FileEntry::Type
         >,
         member<
           FileEntry,
           std::string,
           &FileEntry::Name
         >
       >
     >
   >
 > FileSet;


But after that it's super easy to use, uh, if you know STL iterator 
speak anyway. 8-)

Having a file set is easy and typesafe in any type-able context (such as 
a member of another structure, or a global):
FileSet result;

Picking yoru index/view has to be done by (admittedly ugly looking) type
FileSet::index<FileEntry::NameIndex>::type

But then when you make one and plumb it ("result" being the working set
FileSet::index<FileEntry::NameIndex>::type & by_name = 
result.get<FileEntry::NameIndex>();

by_name is now a fully functional view of the set as an iteraterable 
container with elements like by_name.erase(some_name) and such.

It gets you what you say you want from SQL but without all the 
indirection or duplication of SQL.

(Note that the attached example will fill a FileSet object into memory 
with the entire (sub) tree of a file system if you call Fill() with the 
deep selector; or just fill the set with the contents of a single 
directory when the shallow selector is used.

etc.

So anyway. I've been in this modeling space for a while. I looked at 
other means of modeling such as SQL and I decided this one was "the 
best" for what I was doing.

The results are fast, deterministic, and type-safe with no internal 
typecasts or full table scan results.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: FileEntries.h --]
[-- Type: text/x-chdr; name="FileEntries.h", Size: 6532 bytes --]

#ifndef FILEENTRIES_HPP
#define FILEENTRIES_HPP

/* Part of Underdog. https://underdog.sourceforge.net
 * Released under GPLv3. Other licenses may be available.
 * Robert White © Copyright 2014 <rwhite@pobox.com>
 */

#include <dirent.h>     /* Defines DT_* constants */
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <regex.h>


#include <boost/multi_index_container.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index/hashed_index.hpp>
#include <boost/multi_index/identity.hpp>
#include <boost/multi_index/member.hpp>
#include <boost/multi_index/composite_key.hpp>
#include <boost/multi_index/tag.hpp>

#include "FileWrapper.h"

namespace FileEntries {

using namespace ::boost;
using namespace ::boost::multi_index;

enum FileType {
  UnknownFile,
  BlockFile,
  CharFile,
  DirectoryFile,
  PipeFile,
  SymlinkFile,
  SocketFile,
  RegularFile
};

inline
enum FileType
DecodeFileType(const int filetype)
{
  enum FileType retval = UnknownFile;
  switch (filetype) {
  case DT_BLK: retval = BlockFile; break;
  case DT_CHR: retval = CharFile; break;
  case DT_DIR: retval = DirectoryFile; break;
  case DT_FIFO: retval = PipeFile; break;
  case DT_LNK: retval = SymlinkFile; break;
  case DT_REG: retval = RegularFile; break;
  case DT_SOCK: retval = BlockFile; break;
  }
  return retval;
}


struct FileEntry {
  struct NameIndex {};
  struct DirectoryIndex {};
  struct INodeIndex {};
  struct TypedIndex {};
  dev_t		Device;
  long		ParentINode;
  long		INode;
  std::string	Name;
  enum FileType	Type;
};

typedef multi_index_container<
  FileEntry,
  indexed_by<
    ordered_non_unique<
      tag<FileEntry::NameIndex>,
      member<
	FileEntry,
	std::string,
	&FileEntry::Name
      >
    >,
    ordered_non_unique<
      tag<FileEntry::DirectoryIndex>,
      composite_key<
        FileEntry,
        member<
          FileEntry,
	  dev_t,
	  &FileEntry::Device
        >,
        member<
	  FileEntry,
	  long,
	  &FileEntry::ParentINode
        >
      >
    >,
    ordered_non_unique<
      tag<FileEntry::INodeIndex>,
      composite_key<
        FileEntry,
        member<
          FileEntry,
	  dev_t,
	  &FileEntry::Device
        >,
        member<
	  FileEntry,
	  long,
	  &FileEntry::INode
        >
      >
    >,
    ordered_non_unique<
      tag<FileEntry::TypedIndex>,
      composite_key<
        FileEntry,
        member<
          FileEntry,
	  enum FileType,
	  &FileEntry::Type
	>,
	member<
	  FileEntry,
	  std::string,
	  &FileEntry::Name
	>
      >
    >
  >
> FileSet;

  struct regex {
    regex_t	built;
    regex(const std::string pattern):
      built({})
    {
      if (regcomp(&built,pattern.c_str(),REG_EXTENDED|REG_NOSUB) != 0) {
	throw std::exception();
      }
    }
    ~regex()
    {
      regfree(&built);
    }
    bool operator()(const std::string & value) const
    {
      return 0 == regexec(&built,value.c_str(),0,0,0);
    }
  };



  struct selector
  {
    typedef ::std::pair<bool,bool> result_type;
  };

  struct shallow_selector:
    public selector
  {
    regex	include_pattern;
		shallow_selector(const std::string pattern 
		  = ::std::string("^(([^.])|([.][^.])|([.]{2}.))")):
		include_pattern(pattern)
		{
		  return;
		}
    result_type operator()(const FileEntry & ent) const
    {
      return result_type(include_pattern(ent.Name),false);
    }
  };

  struct deep_selector:
    public selector
  {
    regex	include_pattern;
    regex	recurse_pattern;
		deep_selector(
		  const std::string i_pattern = "^(([^.])|([.][^.])|([.]{2}.))",
		  const std::string r_pattern = "^(([^.])|([.][^.])|([.]{2}.))"
		):
		  include_pattern(i_pattern),
		  recurse_pattern(r_pattern)
		{
		  return;
		}
    result_type
    operator()(const FileEntry & ent) const
    {
      bool candidate = (ent.Type == DirectoryFile) || (ent.Type == SymlinkFile);
      return result_type(include_pattern(ent.Name),candidate && recurse_pattern(ent.Name));
    }
  };

  template <
    class U,
    bool First = true,
    bool Second = false
  >
  struct inverted:
    public selector
  {
    const U &	Real;
    inverted(const U & real):
      Real(real)
    {
      return;
    }
    result_type
    operator()(const FileEntry & ent) const
    {
      result_type retval = Real(ent);
      if (First) {
	retval.first = !retval.first;
      }
      if (Second) {
	retval.second = !retval.second;
      }
      return retval;
    }
  };

struct linux_dirent {
  long           d_inode;
  off_t          d_offset;
  unsigned short d_reclen;
  char           d_name[1];
};

union cursor_type {
  char *		ch;
  struct linux_dirent *	ld;
};

template <class U>
void
Fill(int directory_fd,
     FileSet & result,
     const U & selector
    )
{
  FileEntry temp = {};
  int	count;
  char	buffer[1024];
  char*	cursor;
  char*	extent;
  struct stat directory_stat = {};
  fstat(directory_fd,&directory_stat);
  FileSet::index<FileEntry::DirectoryIndex>::type & result_view =
    result.get<FileEntry::DirectoryIndex>();
  temp.Device = directory_stat.st_dev;
  temp.ParentINode = directory_stat.st_ino;
  result_view.erase(
    result_view.lower_bound(boost::make_tuple(temp.Device,temp.ParentINode)),
    result_view.upper_bound(boost::make_tuple(temp.Device,temp.ParentINode))
  );
  lseek(directory_fd,0,SEEK_SET);
  while ((count = syscall(SYS_getdents,directory_fd,buffer,sizeof(buffer))) > 0) {
    int next = 0;
    extent = buffer + count;
    for (cursor = buffer; cursor < extent; cursor += next) {
      linux_dirent * entry = reinterpret_cast<linux_dirent *>(cursor);
      next = entry->d_reclen;
      temp.INode = entry->d_inode;
      temp.Name = entry->d_name;
      temp.Type = DecodeFileType(static_cast<int>(*(cursor + next - 1)));
      ::std::pair<bool,bool> selection = selector(temp);
      if (selection.first) {
        result_view.insert(temp);
      }
      if (selection.second) try {
	FileWrapper subdir(temp.Name,O_DIRECTORY|O_RDONLY,directory_fd);
	Fill(subdir.fd(),result,selector);
      } catch (...) {
	  // No recursion on error, silently, it's just easier.
      }
    }
  }
  return;
}

template <>
void
Fill<const std::string &>(int directory_fd,
     FileSet & result,
     const std::string & selector
    )
{
  return Fill(directory_fd,result,shallow_selector(selector));
}

inline
void
Fill(int directory_fd,
     FileSet & result
    )
{
  return Fill(directory_fd,result,shallow_selector());
}


}
#endif

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: FileWrapper.h --]
[-- Type: text/x-chdr; name="FileWrapper.h", Size: 1501 bytes --]

#ifndef FILEWRAPPER_HPP
#define FILEWRAPPER_HPP

/* Part of Underdog. https://underdog.sourceforge.net
 * Released under GPLv3. Other licenses may be available.
 * Robert White © Copyright 2014 <rwhite@pobox.com>
 */


#include <string>
#include <exception>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>


struct FileWrapper {
  struct BadFile: public std::exception { };
  
  FileWrapper(std::string name, int flags, int base = AT_FDCWD):
    FileDescriptor(openat(base,name.c_str(),flags))
  {
    if (FileDescriptor == -1) {
      throw BadFile();
    }
  }

  explicit
  FileWrapper(const FileWrapper & X):
    FileDescriptor(dup(X.FileDescriptor))
  {
    if (FileDescriptor == -1) {
      throw BadFile();
    }
  }
  
  explicit
  FileWrapper(int other_fd):
    FileDescriptor(dup(other_fd))
  {
    if (FileDescriptor == -1) {
      throw BadFile();
    }
  }
  
  FileWrapper(int other_fd, int new_fd):
    FileDescriptor(dup2(other_fd,new_fd))
  {
    if (FileDescriptor == -1) {
      throw BadFile();
    }
  }
  
  void
  Replace(const FileWrapper & X)
  {
    int newfd = dup(X.FileDescriptor);
    if (newfd != -1) {
      close(FileDescriptor);
      FileDescriptor = newfd;
    }
  }
  
  ~FileWrapper()
  {
    close(FileDescriptor);
  }
  
  operator int() const { return FileDescriptor; }
  int fd() const { return FileDescriptor; }
  
protected:
  int FileDescriptor;

private:
  FileWrapper & operator =(const FileWrapper & X);
};

#endif

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-03 19:18         ` Robert White
@ 2014-12-04  6:56           ` Qu Wenruo
  2014-12-10 21:57             ` Zygo Blaxell
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2014-12-04  6:56 UTC (permalink / raw)
  To: Robert White, linux-btrfs; +Cc: David Sterba


-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Robert White <rwhite@pobox.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, linux-btrfs 
<linux-btrfs@vger.kernel.org>
Date: 2014年12月04日 03:18
> On 12/01/2014 05:17 PM, Qu Wenruo wrote:
>> But I am also somewhat tired of bringing new structure new searching
>> functions or even bring larger change on
>> the btrfsck record infrastructure when I found that can't provide the
>> function when new recovery function is going
>> to be implemented.
>
> DISCLAIMER: My actual thing is modeling, so coming from the idea that 
> btrfsck is basically building a model of the filesystem and comparing 
> the theoretically correct model to the actual filesystem as found... 
> and having spent nearly zero time inside of the btrfsck code base 
> itself...
>
> I've looked at what I _think_ you are trying to do with the 
> lookup/correlate phase finding peer/parent/child references and I'd 
> think other things would make sense far more sense than SQL.
>
> C++ Standard Template Library at a minimum, life is too short to 
> reinvent data lookups in memory. But its also too short to re-factor 
> SQL queries or wait for degenerate full-table scans.
C++ doesn't comes to me in my first thought, since I haven't use it for 
a long long time... :-(
>
> The boost "multi-index container", if you are going to bring in an 
> external library, would do you far better for dealing with future 
> structural changes. It would directly index the fields of your 
> (parent,child,name) relation and let you switch iteration on-the-fly 
> (e.g. you can be walking parents and use the same iterator to start 
> walking names etc). (there's even a cookbook item for adding an LRU 
> cache as a natural index for space-constrained items.) Again you'd 
> have to switch over to C++, but it implements exactly what I think you 
> are looking for. the multi-index container declaration syntax gets a 
> little thick but once declared the usage model is really simple. (you 
> pay most-or-all the cost during "update" which is really a 
> remove-then-readd but you won't be updating as often as adding so...)
Yeah, C++ seems much easier to switch, but as the preparation, it is 
better to cleanup the inode_record codes from
cmds-check.c and put them into a single file.
After the cleanup, I think we can even have different backend to test with.
(Only to test, multi-backend support also seems to be overkilled)
>
> My "crazy-ist idea" would be to maybe mmap all the metadata from the 
> filesystem into the process space and build a page-table-like index 
> over that. The virtual memory demand paging will become your caching 
> buddy. mapped_block+offest gets you directly to the whole native 
> structure. You might have to juggle mappings (map and unmap them) on 
> systems with arger filesystems but smaller virtual tables (e.g. 
> classic pentium 32 bit). Even then, you'd be juggling mappings and the 
> kernel would be doing all the real caching. The more data you can 
> leave in the mmapped ranges the less storage management takes place in 
> the local code.
>
> IMHO, finding a USB stick to make a swap space on is probably just as 
> good, if not better than, doing it with a whole other filesystem and 
> the "put my temp files there" option during emergency maintenance.
Sadly, there is already user complaining about the memory usage:
http://comments.gmane.org/gmane.comp.file-systems.btrfs/34573

The 13T fs may contain several hundreds GB of metadata, already takes 
over 8G memory.
It may takes up 10+ of GB memory, so a memory stick may help but when it 
happens on a remote server?
Or can you tolerate the snail like slow IO speed?
So better store them on disk when things go really huge.

Map and unmap will be good,but it may not resolve the problem at all.
The main memory usage in btrfsck is extent record, which
we can't free them until we read them all in and checked, so even we 
mmap/unmap, it can only help with
the extent_buffer(which is already freed if not used according to refs).

Now, I miss the page cache in kernel so much...

Anyway, btrfsck should really consider about the memory usage now.

>
>
> If the mailing list engine honors attachments, you will find 
> FileEntries.h from one of my projects, offered as a simplified example 
> of a directory entry cache implemented in a boost multi-index container.
>
> (Note that this is fully working code from an upcoming addition to my 
> soruceforge project, so it's not just theoretical implementation ideas.)
>
> struct FileEntry {
>   struct NameIndex {};
>   struct DirectoryIndex {};
>   struct INodeIndex {};
>   struct TypedIndex {};
>   dev_t         Device;
>   long          ParentINode;
>   long          INode;
>   std::string   Name;
>   enum FileType Type;
> };
>
> The empty structs inside the struct act as index identifiers/names 
> (Because C++ templates only disambiguate on types so 
> FileEntry::NameIndex is the type-name of the index of file names (etc).
>
> Any other data could be added to FileEntry without disrupting any 
> existing code for indexing or iterating.
>
> The container itself is a little grusome textually ::
>
> typedef multi_index_container<
>   FileEntry,
>   indexed_by<
>     ordered_non_unique<
>       tag<FileEntry::NameIndex>,
>       member<
>         FileEntry,
>         std::string,
>         &FileEntry::Name
>       >
>     >,
>     ordered_non_unique<
>       tag<FileEntry::DirectoryIndex>,
>       composite_key<
>         FileEntry,
>         member<
>           FileEntry,
>           dev_t,
>           &FileEntry::Device
>         >,
>         member<
>           FileEntry,
>           long,
>           &FileEntry::ParentINode
>         >
>       >
>     >,
>     ordered_non_unique<
>       tag<FileEntry::INodeIndex>,
>       composite_key<
>         FileEntry,
>         member<
>           FileEntry,
>           dev_t,
>           &FileEntry::Device
>         >,
>         member<
>           FileEntry,
>           long,
>           &FileEntry::INode
>         >
>       >
>     >,
>     ordered_non_unique<
>       tag<FileEntry::TypedIndex>,
>       composite_key<
>         FileEntry,
>         member<
>           FileEntry,
>           enum FileType,
>           &FileEntry::Type
>         >,
>         member<
>           FileEntry,
>           std::string,
>           &FileEntry::Name
>         >
>       >
>     >
>   >
> > FileSet;
>
>
> But after that it's super easy to use, uh, if you know STL iterator 
> speak anyway. 8-)
>
> Having a file set is easy and typesafe in any type-able context (such 
> as a member of another structure, or a global):
> FileSet result;
>
> Picking yoru index/view has to be done by (admittedly ugly looking) type
> FileSet::index<FileEntry::NameIndex>::type
>
> But then when you make one and plumb it ("result" being the working set
> FileSet::index<FileEntry::NameIndex>::type & by_name = 
> result.get<FileEntry::NameIndex>();
>
> by_name is now a fully functional view of the set as an iteraterable 
> container with elements like by_name.erase(some_name) and such.
>
> It gets you what you say you want from SQL but without all the 
> indirection or duplication of SQL.
>
> (Note that the attached example will fill a FileSet object into memory 
> with the entire (sub) tree of a file system if you call Fill() with 
> the deep selector; or just fill the set with the contents of a single 
> directory when the shallow selector is used.
>
> etc.
>
> So anyway. I've been in this modeling space for a while. I looked at 
> other means of modeling such as SQL and I decided this one was "the 
> best" for what I was doing.
>
> The results are fast, deterministic, and type-safe with no internal 
> typecasts or full table scan results.
Great implement! Almost fits all the needs, except the ability to save 
it on disk...

In fact, I have already an idea to implement it in pure C, although not 
so generic than C++ template, but should
work anyway.

So the remaining problem will be the ability to cache things on disk for 
case like the 13T array.

For C/C++, saving recording to disk will be a simplified version of page 
cache, but will still be a huge chanllenge.
For SQL, it will be much easier but I am consider it as the last try 
now, since the C++ implement seems quite good.

Anyway, the memory usage problem is still a problem to resolve.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-04  6:56           ` Qu Wenruo
@ 2014-12-10 21:57             ` Zygo Blaxell
  2014-12-11  2:05               ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Zygo Blaxell @ 2014-12-10 21:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Robert White, linux-btrfs, David Sterba

[-- Attachment #1: Type: text/plain, Size: 990 bytes --]

On Thu, Dec 04, 2014 at 02:56:55PM +0800, Qu Wenruo wrote:
> The main memory usage in btrfsck is extent record, which
> we can't free them until we read them all in and checked, so even we
> mmap/unmap, it can only help with
> the extent_buffer(which is already freed if not used according to refs).

I'm thinking aloud here, but is it *really* necessary to read everything
into memory?  Maybe a multiple-pass algorithm might be possible, e.g. one
to find free space by eliminating any areas that are occupied by extents,
then other passes to rebuild the metadata in the free space.  Or, one
pass to verify the connectivity of references and collect dangling refs,
then a second pass which fixes only the dangling refs.

Usually sequential reads are significantly faster than swapping--even
if swapping on solid-state media.  It could be that reading 260GB of
metadata sequentially two or three times is still faster than thrashing
through random lookups in 20GB of swap on a 4GB machine.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-10 21:57             ` Zygo Blaxell
@ 2014-12-11  2:05               ` Qu Wenruo
  2014-12-11  2:27                 ` Zygo Blaxell
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2014-12-11  2:05 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Robert White, linux-btrfs, David Sterba


-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Zygo Blaxell <zblaxell@furryterror.org>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: 2014年12月11日 05:57
> On Thu, Dec 04, 2014 at 02:56:55PM +0800, Qu Wenruo wrote:
>> The main memory usage in btrfsck is extent record, which
>> we can't free them until we read them all in and checked, so even we
>> mmap/unmap, it can only help with
>> the extent_buffer(which is already freed if not used according to refs).
> I'm thinking aloud here, but is it *really* necessary to read everything
> into memory?
Totally agreed to only read what we need.
But some backref and counts on refs can only be determined after a full 
scan, especially for leaf/node corruption
case.
>    Maybe a multiple-pass algorithm might be possible, e.g. one
> to find free space by eliminating any areas that are occupied by extents,
> then other passes to rebuild the metadata in the free space.  Or, one
> pass to verify the connectivity of references and collect dangling refs,
> then a second pass which fixes only the dangling refs.
I have similar idea, but not multi-pass method, instead, using per 
sector scan + tree search for other data.
E.g in extent tree check, each time only record all extents in a block 
group, and check them.
After check, remove the good extents/block groups and then move to next 
block group.
For fs tree, any key with same objectid(ino) as a group, and only read  
the group in one time and remove
the already known healthy record. (info not fully gathered or bad record 
will still stay in memory)

But I don't consider this method can really save much memory though...
>
> Usually sequential reads are significantly faster than swapping--even
> if swapping on solid-state media.  It could be that reading 260GB of
> metadata sequentially two or three times is still faster than thrashing
> through random lookups in 20GB of swap on a 4GB machine.
>
Definitely, but if we want to reduce memory usage, it is almost 
unavoidable to do more disk IO, especially random
disk IO, so it will become a tradeoff, which may cause the already slow 
fsck more slow....

Thanks,
Qu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-11  2:05               ` Qu Wenruo
@ 2014-12-11  2:27                 ` Zygo Blaxell
  0 siblings, 0 replies; 18+ messages in thread
From: Zygo Blaxell @ 2014-12-11  2:27 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Robert White, linux-btrfs, David Sterba

[-- Attachment #1: Type: text/plain, Size: 1489 bytes --]

On Thu, Dec 11, 2014 at 10:05:20AM +0800, Qu Wenruo wrote:
> 
> -------- Original Message --------
> Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
> From: Zygo Blaxell <zblaxell@furryterror.org>
> To: Qu Wenruo <quwenruo@cn.fujitsu.com>
> Date: 2014年12月11日 05:57
> >On Thu, Dec 04, 2014 at 02:56:55PM +0800, Qu Wenruo wrote:
> >>The main memory usage in btrfsck is extent record, which
> >>we can't free them until we read them all in and checked, so even we
> >>mmap/unmap, it can only help with
> >>the extent_buffer(which is already freed if not used according to refs).
> >I'm thinking aloud here, but is it *really* necessary to read everything
> >into memory?
> Totally agreed to only read what we need.
> But some backref and counts on refs can only be determined after a
> full scan, especially for leaf/node corruption
> case.

It might be faster (and smaller) to pipe them out to sort (with gzip/lzma
compression on temporary files) than to try to insert them in a tree.
I have used that technique in some of my deduplicating programs.  It can
cut the working set size by several orders of magnitude (trading it for
an O(n log n) sort, which will mostly read and write sequentially).

e.g. duplicate refs will all sort together, so when you are sequentially
reading the sorted data and the current key value changes, you know you've
seen everything that could be a duplicate, and can discard everything
in RAM.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-02  0:37   ` Qu Wenruo
@ 2014-12-11 19:00     ` Martin Steigerwald
  0 siblings, 0 replies; 18+ messages in thread
From: Martin Steigerwald @ 2014-12-11 19:00 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Austin S Hemmelgarn, linux-btrfs

Hi Qu,

Am Dienstag, 2. Dezember 2014, 08:37:44 schrieb Qu Wenruo:
> -------- Original Message --------
> Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
> From: Austin S Hemmelgarn <ahferroin7@gmail.com>
> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, linux-btrfs
> <linux-btrfs@vger.kernel.org>
> Date: 2014年12月01日 20:53
> 
> > On 2014-11-30 20:58, Qu Wenruo wrote:
> >> [snipped]
> > 
> > So, I think this does a good job of highlighting one of the bigger
> > issues with btrfsck when it is compared to ext* and/or xfs. Despite
> > this being a problem, I really don't think using a rdbms is the way to
> > fix it, both for reasons outlined in other responses, and because fsck
> > should be as fast as possible when nothing is wrong with the fs.
> 
> Although I am not stick to the crazy idea, I think it is still needed to
> point out that,
> even btrfsck is ran on a clean btrfs, it still needs to iterate all the
> extents, metadata.
> So it may not be as fast as you thought even with the current implement.
> 
> Anyway thanks for the feedback.

As to my knowledge at some time xfs_repair struggled with memory issues as 
well. And the developers of it made an attempt at reducing memory usage. I 
think with e2fsck this has been an issue as well. So maybe might be an idea at 
the approaches the devs of these two filesystem checking tools used to reduce 
memory usage.

On the other hand it may be that XFS and Ext4 is just too different to BTRFS to 
adapt those approaches. But in the end they are all filesystems, so there may 
be enough similarities to port over some ideas.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
  2014-12-01  1:58 Crazy idea of cleanup the inode_record btrfsck things with SQL? Qu Wenruo
                   ` (2 preceding siblings ...)
  2014-12-01 12:53 ` Austin S Hemmelgarn
@ 2014-12-11 19:38 ` Roger Binns
  3 siblings, 0 replies; 18+ messages in thread
From: Roger Binns @ 2014-12-11 19:38 UTC (permalink / raw)
  To: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/30/2014 05:58 PM, Qu Wenruo wrote:
> 2. Heavy dependency If use it, btrfs-progs will include RDBMS as
> the make and runtime dependency. Such low level progs depend on
> high level programs like sqlite3 may be very strange.

BTW SQLite is designed as a library.  It is shipped as a single file
with the deliberate intention you add the sqlite3.c file to your
project.  For private internal tool use you don't need to depend or
use the system SQLite in any way.

  https://www.sqlite.org/amalgamation.html

SQLite also lets you easily define collation sequences, your own
functions and virtual tables which will make all this easier.  Using
pragmas you can control use of memory and disk space for operation.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iEYEARECAAYFAlSJ8rAACgkQmOOfHg372QRzJACgoip0vhoM0XEkVIB9/ZggPXX1
PuMAn3/0lP+SQyxDh6UFStt5hlA2Wwkz
=b5GF
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-12-11 19:38 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-01  1:58 Crazy idea of cleanup the inode_record btrfsck things with SQL? Qu Wenruo
2014-12-01  3:08 ` Duncan
2014-12-01  3:24   ` Qu Wenruo
2014-12-01  5:47     ` Duncan
2014-12-01  6:25       ` Qu Wenruo
2014-12-01  4:03 ` Robert White
2014-12-01  6:18   ` Qu Wenruo
2014-12-01 18:10     ` Robert White
2014-12-02  1:17       ` Qu Wenruo
2014-12-03 19:18         ` Robert White
2014-12-04  6:56           ` Qu Wenruo
2014-12-10 21:57             ` Zygo Blaxell
2014-12-11  2:05               ` Qu Wenruo
2014-12-11  2:27                 ` Zygo Blaxell
2014-12-01 12:53 ` Austin S Hemmelgarn
2014-12-02  0:37   ` Qu Wenruo
2014-12-11 19:00     ` Martin Steigerwald
2014-12-11 19:38 ` Roger Binns

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox