* Crazy idea of cleanup the inode_record btrfsck things with SQL?
@ 2014-12-01 1:58 Qu Wenruo
2014-12-01 3:08 ` Duncan
` (3 more replies)
0 siblings, 4 replies; 18+ messages in thread
From: Qu Wenruo @ 2014-12-01 1:58 UTC (permalink / raw)
To: linux-btrfs; +Cc: David Sterba
[BACKGROUND]
I'm trying to implement the function to repair missing inode item.
Under that case, inode type must be salvaged(although it can be fallback to
FILE).
One case should be, if there is any dir_item/index or inode_ref refers the
inode as parent, the type of that inode must be DIR.
However, currently btrfsck implement (inode_record only records backref), we
are unable to search the inode_backref whose parent is given inode number.
[FIRST IMPLEMENT DESIGN]
My first thought is to implement an generic inode-relation structure,
recording parent ino, child ino, name and namelen, and restore the structure
in a rbtree, not in the child/parent's list.
But I soon recognize that this is a perfect use case for relational
database,
as 'ino' as the primary key for INODE table,
('parent_ino', 'child_ino', 'name') as the primary key for INODE_REF table.
[CRAZY IDEA]
So why not using SQL to implement the btrfsck inode-record things?
With such crazy idea, it will be much much easier to do any iteration from a
given ino, and with the already mature RDB implement, like sqlite3, we can
save hundreds of lines of codes implementing the rb-tree or list.
[PROS]
1. Easy to maintain
Now we don't need to maintain the rbtree searching or list
iteration, but
easy SQL lines and its wrapper.
2. Easy to extend
If we need to record something more, like extents and its relation to
inode, we only need to create 2 tables and several SQL and wrappers.
3. Reduced memory usage for HUGE fs.
When metadata grows to several TB or even more, current rb-tree based
implement may run short of memory since they are all stored in memory.
But if use SQL, RDBMS like sqlite3 can restore things in either
memory or
disk, which may hugely reduce the memory usage for huge btrfs.
If not use existing RDBMS, we need to implement complicated memory
control
system to manage memory in userland.
[CONS]
1. Heavy implement
SQL hide the rb-tree or B+ tree implement but costs more memory(if not
compressed) and CPU cycles, which will be slower than the simple rb-tree
implement even using lightweight RDBMS like sqlite3.
2. Heavy dependency
If use it, btrfs-progs will include RDBMS as the make and runtime
dependency.
Such low level progs depend on high level programs like sqlite3 may
be very
strange.
3. A lot of rework on existing codes.
Even SQL is easier to maintain and extend, if we use it, we still
need to
reimplement several hundreds or even thousands lines of code to
implement
it, not to mention the regression tests.
4. Copyright
Will it cause any copyright problem if using non-GPL RDBMS like
sqlite3 in
GPLv2 btrfs-progs?
[NEED FEEDBACK]
Any feedback or discussion on the crazy idea is welcomed, since this may
needs
a lot of work, it definitely needs a lot review on the idea before it
comes to
codes.
Thanks,
Qu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-01 1:58 Crazy idea of cleanup the inode_record btrfsck things with SQL? Qu Wenruo
@ 2014-12-01 3:08 ` Duncan
2014-12-01 3:24 ` Qu Wenruo
2014-12-01 4:03 ` Robert White
` (2 subsequent siblings)
3 siblings, 1 reply; 18+ messages in thread
From: Duncan @ 2014-12-01 3:08 UTC (permalink / raw)
To: linux-btrfs
Qu Wenruo posted on Mon, 01 Dec 2014 09:58:27 +0800 as excerpted:
> [CRAZY IDEA]
> So why not using SQL to implement the btrfsck inode-record things?
> 2. Heavy dependency
> If use it, btrfs-progs will include RDBMS as the make and runtime
> dependency. Such low level progs depend on high level programs
> like sqlite3 may be very strange.
I expect this will turn many of the "traditionalists" off, at least. I
could see a lot of traditional sysadmins lumping btrfs in with systemd if
it started requiring a db, much as one of the big objections to systemd
is the dbus requirement... even for headless servers that have never
required it before. Of course they could be ignored, but do we really
want to go there?
(Personally, my gut reaction is "eew", and of course getting database
file handling correct after an ungraceful shutdown/reboot is one of the
big challenges for a filesystem as it is, so I'm not entirely sure
storing information in a database file in ordered to use it to help fix
the filesystem is a good idea since it could well be that you end up
needing an fsck to restore the file... to do the fsck, but I could be
convinced. I'm worried about the ones that can't be.)
> 4. Copyright
> Will it cause any copyright problem if using non-GPL RDBMS like
> sqlite3 in GPLv2 btrfs-progs?
I just checked and at least on gentoo, sqlite's license is registered as
"public domain", which is legally mergeable with code under any other
license free or proprietary, so there should be no problem with it. If
something else is used of course it would depend on its license.
I believe the general kernel-rules practice for such compatible license
merging is to keep code under compatible licenses in separate files and
keep the individual files under their individual licenses. While if it's
compatible I don't believe that's generally an actual legal requirement,
the BSD folks in particular tend to be /very/ sensitive about code
formerly under the BSD license, for instance, merged directly into GPL
headlined files, because in that case they can't reverse the process.
Personally I don't see the big deal, since they seem to have /no/ problem
with their code being taken proprietary where they can't even /look/ at
it, but a /huge/ problem with it being taken GPL, where they may not be
able to directly copy back to BSD, but they obviously have the code
available to look at anyway and still mergeable provided they dual
license, and that makes absolutely no sense to me. <shrug>
So while I don't think it'll go over well, there should be no license
issues at least with sqlite.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-01 3:08 ` Duncan
@ 2014-12-01 3:24 ` Qu Wenruo
2014-12-01 5:47 ` Duncan
0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2014-12-01 3:24 UTC (permalink / raw)
To: Duncan, linux-btrfs
-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Duncan <1i5t5.duncan@cox.net>
To: <linux-btrfs@vger.kernel.org>
Date: 2014年12月01日 11:08
> Qu Wenruo posted on Mon, 01 Dec 2014 09:58:27 +0800 as excerpted:
>
>> [CRAZY IDEA]
>> So why not using SQL to implement the btrfsck inode-record things?
>> 2. Heavy dependency
>> If use it, btrfs-progs will include RDBMS as the make and runtime
>> dependency. Such low level progs depend on high level programs
>> like sqlite3 may be very strange.
> I expect this will turn many of the "traditionalists" off, at least. I
> could see a lot of traditional sysadmins lumping btrfs in with systemd if
> it started requiring a db, much as one of the big objections to systemd
> is the dbus requirement... even for headless servers that have never
> required it before. Of course they could be ignored, but do we really
> want to go there?
Oh,so terrible the systemd warfare. :(
This objection sounds very solid now.
Anyway, this is a crazy idea...
(Maybe it is only me so lazy to implement the rb-tree based things)
>
> (Personally, my gut reaction is "eew", and of course getting database
> file handling correct after an ungraceful shutdown/reboot is one of the
> big challenges for a filesystem as it is, so I'm not entirely sure
> storing information in a database file in ordered to use it to help fix
> the filesystem is a good idea since it could well be that you end up
> needing an fsck to restore the file... to do the fsck, but I could be
> convinced. I'm worried about the ones that can't be.)
The db file is mostly used in memory, only when the metadata is really
really big, maybe when the fs tree's level
is 7 or 8 we may need to use db file.
And the db file should be anonymous(unlinked but open) since it is only
used in one btrfsck session,
not reused or really needed to be stored.
So I will not be a problem of restore or something like that.
>
>> 4. Copyright
>> Will it cause any copyright problem if using non-GPL RDBMS like
>> sqlite3 in GPLv2 btrfs-progs?
> I just checked and at least on gentoo, sqlite's license is registered as
> "public domain", which is legally mergeable with code under any other
> license free or proprietary, so there should be no problem with it. If
> something else is used of course it would depend on its license.
>
> I believe the general kernel-rules practice for such compatible license
> merging is to keep code under compatible licenses in separate files and
> keep the individual files under their individual licenses. While if it's
> compatible I don't believe that's generally an actual legal requirement,
> the BSD folks in particular tend to be /very/ sensitive about code
> formerly under the BSD license, for instance, merged directly into GPL
> headlined files, because in that case they can't reverse the process.
> Personally I don't see the big deal, since they seem to have /no/ problem
> with their code being taken proprietary where they can't even /look/ at
> it, but a /huge/ problem with it being taken GPL, where they may not be
> able to directly copy back to BSD, but they obviously have the code
> available to look at anyway and still mergeable provided they dual
> license, and that makes absolutely no sense to me. <shrug>
>
>
> So while I don't think it'll go over well, there should be no license
> issues at least with sqlite.
>
Thanks for the license check anyway.
Thanks,
Qu.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-01 1:58 Crazy idea of cleanup the inode_record btrfsck things with SQL? Qu Wenruo
2014-12-01 3:08 ` Duncan
@ 2014-12-01 4:03 ` Robert White
2014-12-01 6:18 ` Qu Wenruo
2014-12-01 12:53 ` Austin S Hemmelgarn
2014-12-11 19:38 ` Roger Binns
3 siblings, 1 reply; 18+ messages in thread
From: Robert White @ 2014-12-01 4:03 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs; +Cc: David Sterba
On 11/30/2014 05:58 PM, Qu Wenruo wrote:
> ("why not use SQL to..." suggestion)
SQL, as in Structured Query Language, is _terrible_ for recursion. It
expresses all of its elements in terms of set theory and really can only
implement union and intersection of flat sets.
Several companies offer extensions to SQL in their implementations to
help with this lack of recursion such as "prior" in Oracle's PSQL, but
they are all stateful beyond reason.
Several companies, including microsoft, have proposed and partially
implemented "a relational database as a file system" paradigm and then
crashed into the fact that dealing with the parent of the parent of
something is different than dealing with the parent of the parent of the
parent of something.
There is a humours-but-true saying: "If you have a problme, and you
decide to solve it with (regex or xml or uml or sql etc) you now have
two problems."
Writing the SQL to walk the tree is harder than allocating the memory as
a vector, filling it with the data, and then walking the pointers.
Your suggestion is the first step on the road to The Inner Platform
Effect™. You have a specialized database (parent, inode, name) and now
you want to put a generic database engine over the specialized database
so that you an re-implement the specialized database with generic
primitives.
http://en.wikipedia.org/wiki/Inner-platform_effect
Things need to be only as generic as they need to be, and no more
generic than that.
Replacing a pointer to a record with a pointer to a cursor's result
table that will give you the name of the next result to query is not a
win. Even as you spell it out you can see that it is _not_ a reduction
in memory or processing.
And the "easy SQL lines" stop being that easy when "name" stops being
unique.
(I've been down this road before. Not with file systems but with
"managed objects" in a network management system. Nodes, Parent nodes,
etc. Just referring to distributed things like networks switches instead
of file system inodes. ... It doesn't end well. 8-) )
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-01 3:24 ` Qu Wenruo
@ 2014-12-01 5:47 ` Duncan
2014-12-01 6:25 ` Qu Wenruo
0 siblings, 1 reply; 18+ messages in thread
From: Duncan @ 2014-12-01 5:47 UTC (permalink / raw)
To: linux-btrfs
Qu Wenruo posted on Mon, 01 Dec 2014 11:24:50 +0800 as excerpted:
> The db file is mostly used in memory, only when the metadata is really
> really big, maybe when the fs tree's level is 7 or 8 we may need to use
> db file.
So fscking the database in ordered to fsck the database isn't an issue.
One objection down! =:^)
But seriously, the politics of the idea remains its biggest nemesis in my
opinion. And in systemd we've unfortunately a live demonstration of just
how big a nemesis that can be. =:^( If the technical reasoning for it
is sound and the benefit high enough, great, but IMO the benefit will
need to be pretty high to justify the risk of political fallout, and I
doubt it's anything close to that high. But it's not my call, so we'll
see. Thinks could certainly get interesting if it's judged to be worth
it. < Checking popcorn stash >
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-01 4:03 ` Robert White
@ 2014-12-01 6:18 ` Qu Wenruo
2014-12-01 18:10 ` Robert White
0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2014-12-01 6:18 UTC (permalink / raw)
To: Robert White, linux-btrfs; +Cc: David Sterba
-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Robert White <rwhite@pobox.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, linux-btrfs
<linux-btrfs@vger.kernel.org>
Date: 2014年12月01日 12:03
> On 11/30/2014 05:58 PM, Qu Wenruo wrote:
>> ("why not use SQL to..." suggestion)
>
> SQL, as in Structured Query Language, is _terrible_ for recursion. It
> expresses all of its elements in terms of set theory and really can
> only implement union and intersection of flat sets.
>
> Several companies offer extensions to SQL in their implementations to
> help with this lack of recursion such as "prior" in Oracle's PSQL, but
> they are all stateful beyond reason.
>
> Several companies, including microsoft, have proposed and partially
> implemented "a relational database as a file system" paradigm and then
> crashed into the fact that dealing with the parent of the parent of
> something is different than dealing with the parent of the parent of
> the parent of something.
>
> There is a humours-but-true saying: "If you have a problme, and you
> decide to solve it with (regex or xml or uml or sql etc) you now have
> two problems."
Wait, regex and uml and xml is OK, but never heard sql is one of them...
>
> Writing the SQL to walk the tree is harder than allocating the memory
> as a vector, filling it with the data, and then walking the pointers.
In fact, such INODE and INODE_REF table is not (completely nor mainly)
used to walk the tree,
it is mainly used to search for:
1. is there any inode_ref refers to a given ino as parent.
This will not even be a problem when the fs is *OK*, since do a simple
btrfs_search_slot()
with key( objectied = ino, type = BTRFS_DIR_INDEX/ITEM_KEY, offset = 0)
will do it.
However when it comes to corrupted leaf, the whole INODE_ITEM with its
DIR_INDEX/ITEM are gone
with the leaf, so the old search way is not usable and btrfs-progs will
relay on other mechanism
to determine that.
And unfortunately, there is no such mechanism.
2. is there any dir_index/dir_item refers to a given ino as child.
Current inode_record works fine for this object.
So when the crazy idea disappear and sane ideas come back, it will
probably be rb-tree based
(parent, ino, name, namelen) entries to record parent-child relation
(currently it is a list_head only records backref inside the inode_record).
And another rb-tree based (ino) entries (same as current inode_record
structure).
>
> Your suggestion is the first step on the road to The Inner Platform
> Effect™. You have a specialized database (parent, inode, name) and now
> you want to put a generic database engine over the specialized
> database so that you an re-implement the specialized database with
> generic primitives.
>
> http://en.wikipedia.org/wiki/Inner-platform_effect
>
> Things need to be only as generic as they need to be, and no more
> generic than that.
>
> Replacing a pointer to a record with a pointer to a cursor's result
> table that will give you the name of the next result to query is not a
> win. Even as you spell it out you can see that it is _not_ a reduction
> in memory or processing.
>
> And the "easy SQL lines" stop being that easy when "name" stops being
> unique.
Name is still unique when parent ino is given, so the INODE_REF tables'
primary key is not
name but the (parent, ino, name) combine.
But the inner platform effect still seems valid for my crazy idea.
Anyway, the crazy idea comes to me when I see the RDB like feature in
the inode_record structure,
-and I just want to save sometime coding the new (parent, ino, name,
namelen) rb-tree-.
>
> (I've been down this road before. Not with file systems but with
> "managed objects" in a network management system. Nodes, Parent nodes,
> etc. Just referring to distributed things like networks switches
> instead of file system inodes. ... It doesn't end well. 8-) )
>
The RDB idea must come to you just like me, wanting to write less codes,
right?
So it seems the end may be the same. :-(
Thanks,
Qu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-01 5:47 ` Duncan
@ 2014-12-01 6:25 ` Qu Wenruo
0 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2014-12-01 6:25 UTC (permalink / raw)
To: Duncan, linux-btrfs
-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Duncan <1i5t5.duncan@cox.net>
To: <linux-btrfs@vger.kernel.org>
Date: 2014年12月01日 13:47
> Qu Wenruo posted on Mon, 01 Dec 2014 11:24:50 +0800 as excerpted:
>
>> The db file is mostly used in memory, only when the metadata is really
>> really big, maybe when the fs tree's level is 7 or 8 we may need to use
>> db file.
> So fscking the database in ordered to fsck the database isn't an issue.
> One objection down! =:^)
>
> But seriously, the politics of the idea remains its biggest nemesis in my
> opinion. And in systemd we've unfortunately a live demonstration of just
> how big a nemesis that can be. =:^( If the technical reasoning for it
> is sound and the benefit high enough, great, but IMO the benefit will
> need to be pretty high to justify the risk of political fallout, and I
> doubt it's anything close to that high. But it's not my call, so we'll
> see. Thinks could certainly get interesting if it's judged to be worth
> it. < Checking popcorn stash >
>
No (systemd civilian) war, make love!
I seldom consider the politics problem of the insane idea even the
systemd warfare is still here.
(Emmm, maybe it is because Arch accept systemd too long time ago without
too much problem?)
Anyway, it is just an insane idea and any feedback killing the seed of
insanity as soon as possible is welcomed.
Thanks,
Qu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-01 1:58 Crazy idea of cleanup the inode_record btrfsck things with SQL? Qu Wenruo
2014-12-01 3:08 ` Duncan
2014-12-01 4:03 ` Robert White
@ 2014-12-01 12:53 ` Austin S Hemmelgarn
2014-12-02 0:37 ` Qu Wenruo
2014-12-11 19:38 ` Roger Binns
3 siblings, 1 reply; 18+ messages in thread
From: Austin S Hemmelgarn @ 2014-12-01 12:53 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 3533 bytes --]
On 2014-11-30 20:58, Qu Wenruo wrote:
> [BACKGROUND]
> I'm trying to implement the function to repair missing inode item.
> Under that case, inode type must be salvaged(although it can be fallback to
> FILE).
>
> One case should be, if there is any dir_item/index or inode_ref refers the
> inode as parent, the type of that inode must be DIR.
>
> However, currently btrfsck implement (inode_record only records
> backref), we
> are unable to search the inode_backref whose parent is given inode number.
>
> [FIRST IMPLEMENT DESIGN]
> My first thought is to implement an generic inode-relation structure,
> recording parent ino, child ino, name and namelen, and restore the
> structure
> in a rbtree, not in the child/parent's list.
>
> But I soon recognize that this is a perfect use case for relational
> database,
> as 'ino' as the primary key for INODE table,
> ('parent_ino', 'child_ino', 'name') as the primary key for INODE_REF table.
>
> [CRAZY IDEA]
> So why not using SQL to implement the btrfsck inode-record things?
>
> With such crazy idea, it will be much much easier to do any iteration
> from a
> given ino, and with the already mature RDB implement, like sqlite3, we can
> save hundreds of lines of codes implementing the rb-tree or list.
>
> [PROS]
> 1. Easy to maintain
> Now we don't need to maintain the rbtree searching or list
> iteration, but
> easy SQL lines and its wrapper.
>
> 2. Easy to extend
> If we need to record something more, like extents and its relation to
> inode, we only need to create 2 tables and several SQL and wrappers.
>
> 3. Reduced memory usage for HUGE fs.
> When metadata grows to several TB or even more, current rb-tree based
> implement may run short of memory since they are all stored in memory.
> But if use SQL, RDBMS like sqlite3 can restore things in either
> memory or
> disk, which may hugely reduce the memory usage for huge btrfs.
>
> If not use existing RDBMS, we need to implement complicated memory
> control
> system to manage memory in userland.
>
> [CONS]
> 1. Heavy implement
> SQL hide the rb-tree or B+ tree implement but costs more memory(if not
> compressed) and CPU cycles, which will be slower than the simple
> rb-tree
> implement even using lightweight RDBMS like sqlite3.
>
> 2. Heavy dependency
> If use it, btrfs-progs will include RDBMS as the make and runtime
> dependency.
> Such low level progs depend on high level programs like sqlite3 may
> be very
> strange.
>
> 3. A lot of rework on existing codes.
> Even SQL is easier to maintain and extend, if we use it, we still
> need to
> reimplement several hundreds or even thousands lines of code to
> implement
> it, not to mention the regression tests.
>
> 4. Copyright
> Will it cause any copyright problem if using non-GPL RDBMS like
> sqlite3 in
> GPLv2 btrfs-progs?
>
> [NEED FEEDBACK]
> Any feedback or discussion on the crazy idea is welcomed, since this may
> needs
> a lot of work, it definitely needs a lot review on the idea before it
> comes to
> codes.
>
So, I think this does a good job of highlighting one of the bigger
issues with btrfsck when it is compared to ext* and/or xfs. Despite
this being a problem, I really don't think using a rdbms is the way to
fix it, both for reasons outlined in other responses, and because fsck
should be as fast as possible when nothing is wrong with the fs.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-01 6:18 ` Qu Wenruo
@ 2014-12-01 18:10 ` Robert White
2014-12-02 1:17 ` Qu Wenruo
0 siblings, 1 reply; 18+ messages in thread
From: Robert White @ 2014-12-01 18:10 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs; +Cc: David Sterba
On 11/30/2014 10:18 PM, Qu Wenruo wrote:
> (advocacy for using SQL internally for btrfsck)
All of these ideas you want to toss a entire SQL front end on are more
simply handled with simple data structures.
In C++ terms "map<inode,parent>" and/or "map<parent,vector<children>>"
beats the heck out of including all of SQL and its related indexes and
type conversions (sqlite, for example, stores integers as doubles, or
decimal numbers depending on version).
RDBMS _are_ good at representing things, so noticing that a thing _can_
be represented with an RDBMS is very common.
But by the time you put two or three indexes on relation->(parent,
child, name) you've given yourself three or four copies of the core data
in three or four different places. And those copies are largely
immutable and randomly distributed and will include the overhead in
memory for fairly sparse trees.
It's not that it's an unworkable idea.
But it is unnecessarily generic and adds an order of magnitude of
complexity to your problems.
For instance, if I boot from a CD to run a btrfsck where will the
database files be written to?
If it is an in-memory table why do I want the overhead of SQL to look up
something indexed by integer?
If the sparse vectors of integers don't fit in memory why would the SQL
tables of integers fit "better"?
SQL would be the second slowest possible for representing this data --
The slowest would be an XML schema stored as flat text.
So your crazy ides is also a pretty bad one compared to most if not all
sparse data representations and techniques that come to bear on this
problem set. All you are really doing is pushing the same work (walking
a tree to find an integer) into a difficult "spell it out in SQL" space.
Is prepare_sql(curosr,"SELECT parent FROM parantage_tree WHERE child =
%d"); execute_sql(cursor,child); and its possible error returns actually
clearer or better than "parent=inheretance.find(child); if
(parent!=inheretance.end()) {...}" (as it might be written in C++)?
Do you want to know if (keep track of whether) an inode is allocated and
referenced? There's a sparse bit-vector for that...
Want to be able to get back to an inode's location on disk, a sparse
array of disk offsets exists (among other options).
Before you can even access the RDBMS you'd have to fill it completely;
otherwise you wouldn't know if a select returning zero rows was an
authoritative indication that the datum didn't exist or if it was
instead an indication that the datum hadn't been populated yet.
THIS IS NOT SARCASM: If you strongly disagree, I suggest you start
coding. Seriously, don't ask, do... And in a month really check to see
if your solution is any smaller, faster, easier, or in _any_ _way_ more
optimal than using native data structures. The attempt will answer the
question definitively and then we'll all know...
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-01 12:53 ` Austin S Hemmelgarn
@ 2014-12-02 0:37 ` Qu Wenruo
2014-12-11 19:00 ` Martin Steigerwald
0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2014-12-02 0:37 UTC (permalink / raw)
To: Austin S Hemmelgarn, linux-btrfs
-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, linux-btrfs
<linux-btrfs@vger.kernel.org>
Date: 2014年12月01日 20:53
> On 2014-11-30 20:58, Qu Wenruo wrote:
>> [snipped]
>>
> So, I think this does a good job of highlighting one of the bigger
> issues with btrfsck when it is compared to ext* and/or xfs. Despite
> this being a problem, I really don't think using a rdbms is the way to
> fix it, both for reasons outlined in other responses, and because fsck
> should be as fast as possible when nothing is wrong with the fs.
>
>
Although I am not stick to the crazy idea, I think it is still needed to
point out that,
even btrfsck is ran on a clean btrfs, it still needs to iterate all the
extents, metadata.
So it may not be as fast as you thought even with the current implement.
Anyway thanks for the feedback.
Thanks,
Qu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-01 18:10 ` Robert White
@ 2014-12-02 1:17 ` Qu Wenruo
2014-12-03 19:18 ` Robert White
0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2014-12-02 1:17 UTC (permalink / raw)
To: Robert White, linux-btrfs; +Cc: David Sterba
-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Robert White <rwhite@pobox.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, linux-btrfs
<linux-btrfs@vger.kernel.org>
Date: 2014年12月02日 02:10
> On 11/30/2014 10:18 PM, Qu Wenruo wrote:
>> (advocacy for using SQL internally for btrfsck)
>
> All of these ideas you want to toss a entire SQL front end on are more
> simply handled with simple data structures.
>
> In C++ terms "map<inode,parent>" and/or "map<parent,vector<children>>"
> beats the heck out of including all of SQL and its related indexes and
> type conversions (sqlite, for example, stores integers as doubles, or
> decimal numbers depending on version).
>
> RDBMS _are_ good at representing things, so noticing that a thing
> _can_ be represented with an RDBMS is very common.
>
> But by the time you put two or three indexes on relation->(parent,
> child, name) you've given yourself three or four copies of the core
> data in three or four different places. And those copies are largely
> immutable and randomly distributed and will include the overhead in
> memory for fairly sparse trees.
>
> It's not that it's an unworkable idea.
>
> But it is unnecessarily generic and adds an order of magnitude of
> complexity to your problems.
>
> For instance, if I boot from a CD to run a btrfsck where will the
> database files be written to?
This is easy, memory.
Since only when we judge the fs' metadata is too huge then we will use file.
One of the problem in current inode_record is, btrfsck can only record
them all in memory,
when metadata of the file system is too big, sysadmin can only add swap
space or memory
to handle it.
Although it is not a urgent problem, since 1T btrfs fs with about 5G
metadata will only takes about 500M
checking chunk and extent and even less for checking fs roots.
>
> If it is an in-memory table why do I want the overhead of SQL to look
> up something indexed by integer?
>
> If the sparse vectors of integers don't fit in memory why would the
> SQL tables of integers fit "better"?
>
> SQL would be the second slowest possible for representing this data --
> The slowest would be an XML schema stored as flat text.
>
> So your crazy ides is also a pretty bad one compared to most if not
> all sparse data representations and techniques that come to bear on
> this problem set. All you are really doing is pushing the same work
> (walking a tree to find an integer) into a difficult "spell it out in
> SQL" space.
>
> Is prepare_sql(curosr,"SELECT parent FROM parantage_tree WHERE child =
> %d"); execute_sql(cursor,child); and its possible error returns
> actually clearer or better than "parent=inheretance.find(child); if
> (parent!=inheretance.end()) {...}" (as it might be written in C++)?
>
> Do you want to know if (keep track of whether) an inode is allocated
> and referenced? There's a sparse bit-vector for that...
>
> Want to be able to get back to an inode's location on disk, a sparse
> array of disk offsets exists (among other options).
>
> Before you can even access the RDBMS you'd have to fill it completely;
> otherwise you wouldn't know if a select returning zero rows was an
> authoritative indication that the datum didn't exist or if it was
> instead an indication that the datum hadn't been populated yet.
>
> THIS IS NOT SARCASM: If you strongly disagree, I suggest you start
> coding. Seriously, don't ask, do... And in a month really check to see
> if your solution is any smaller, faster, easier, or in _any_ _way_
> more optimal than using native data structures. The attempt will
> answer the question definitively and then we'll all know...
I know this is a crazy idea and not disagree with your opinion.
But I am also somewhat tired of bringing new structure new searching
functions or even bring larger change on
the btrfsck record infrastructure when I found that can't provide the
function when new recovery function is going
to be implemented.
In fact, after I implement the whole corrupted-leaf recovery patchset, I
may try to implement it as an experimental
try-and-error for cleanup/enhance for the inode_record infrastructure
and see if there is the huge performance drop
or the lines of code reduced(anyway, just a personal try-and-error, will
not send them if there is no such interesting
result, and it may be highly possible a disaster as you mentioned)
Thanks,
Qu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-02 1:17 ` Qu Wenruo
@ 2014-12-03 19:18 ` Robert White
2014-12-04 6:56 ` Qu Wenruo
0 siblings, 1 reply; 18+ messages in thread
From: Robert White @ 2014-12-03 19:18 UTC (permalink / raw)
To: Qu Wenruo, linux-btrfs; +Cc: David Sterba
[-- Attachment #1: Type: text/plain, Size: 6077 bytes --]
On 12/01/2014 05:17 PM, Qu Wenruo wrote:
> But I am also somewhat tired of bringing new structure new searching
> functions or even bring larger change on
> the btrfsck record infrastructure when I found that can't provide the
> function when new recovery function is going
> to be implemented.
DISCLAIMER: My actual thing is modeling, so coming from the idea that
btrfsck is basically building a model of the filesystem and comparing
the theoretically correct model to the actual filesystem as found... and
having spent nearly zero time inside of the btrfsck code base itself...
I've looked at what I _think_ you are trying to do with the
lookup/correlate phase finding peer/parent/child references and I'd
think other things would make sense far more sense than SQL.
C++ Standard Template Library at a minimum, life is too short to
reinvent data lookups in memory. But its also too short to re-factor SQL
queries or wait for degenerate full-table scans.
The boost "multi-index container", if you are going to bring in an
external library, would do you far better for dealing with future
structural changes. It would directly index the fields of your
(parent,child,name) relation and let you switch iteration on-the-fly
(e.g. you can be walking parents and use the same iterator to start
walking names etc). (there's even a cookbook item for adding an LRU
cache as a natural index for space-constrained items.) Again you'd have
to switch over to C++, but it implements exactly what I think you are
looking for. the multi-index container declaration syntax gets a little
thick but once declared the usage model is really simple. (you pay
most-or-all the cost during "update" which is really a remove-then-readd
but you won't be updating as often as adding so...)
My "crazy-ist idea" would be to maybe mmap all the metadata from the
filesystem into the process space and build a page-table-like index over
that. The virtual memory demand paging will become your caching buddy.
mapped_block+offest gets you directly to the whole native structure. You
might have to juggle mappings (map and unmap them) on systems with arger
filesystems but smaller virtual tables (e.g. classic pentium 32 bit).
Even then, you'd be juggling mappings and the kernel would be doing all
the real caching. The more data you can leave in the mmapped ranges the
less storage management takes place in the local code.
IMHO, finding a USB stick to make a swap space on is probably just as
good, if not better than, doing it with a whole other filesystem and the
"put my temp files there" option during emergency maintenance.
If the mailing list engine honors attachments, you will find
FileEntries.h from one of my projects, offered as a simplified example
of a directory entry cache implemented in a boost multi-index container.
(Note that this is fully working code from an upcoming addition to my
soruceforge project, so it's not just theoretical implementation ideas.)
struct FileEntry {
struct NameIndex {};
struct DirectoryIndex {};
struct INodeIndex {};
struct TypedIndex {};
dev_t Device;
long ParentINode;
long INode;
std::string Name;
enum FileType Type;
};
The empty structs inside the struct act as index identifiers/names
(Because C++ templates only disambiguate on types so
FileEntry::NameIndex is the type-name of the index of file names (etc).
Any other data could be added to FileEntry without disrupting any
existing code for indexing or iterating.
The container itself is a little grusome textually ::
typedef multi_index_container<
FileEntry,
indexed_by<
ordered_non_unique<
tag<FileEntry::NameIndex>,
member<
FileEntry,
std::string,
&FileEntry::Name
>
>,
ordered_non_unique<
tag<FileEntry::DirectoryIndex>,
composite_key<
FileEntry,
member<
FileEntry,
dev_t,
&FileEntry::Device
>,
member<
FileEntry,
long,
&FileEntry::ParentINode
>
>
>,
ordered_non_unique<
tag<FileEntry::INodeIndex>,
composite_key<
FileEntry,
member<
FileEntry,
dev_t,
&FileEntry::Device
>,
member<
FileEntry,
long,
&FileEntry::INode
>
>
>,
ordered_non_unique<
tag<FileEntry::TypedIndex>,
composite_key<
FileEntry,
member<
FileEntry,
enum FileType,
&FileEntry::Type
>,
member<
FileEntry,
std::string,
&FileEntry::Name
>
>
>
>
> FileSet;
But after that it's super easy to use, uh, if you know STL iterator
speak anyway. 8-)
Having a file set is easy and typesafe in any type-able context (such as
a member of another structure, or a global):
FileSet result;
Picking yoru index/view has to be done by (admittedly ugly looking) type
FileSet::index<FileEntry::NameIndex>::type
But then when you make one and plumb it ("result" being the working set
FileSet::index<FileEntry::NameIndex>::type & by_name =
result.get<FileEntry::NameIndex>();
by_name is now a fully functional view of the set as an iteraterable
container with elements like by_name.erase(some_name) and such.
It gets you what you say you want from SQL but without all the
indirection or duplication of SQL.
(Note that the attached example will fill a FileSet object into memory
with the entire (sub) tree of a file system if you call Fill() with the
deep selector; or just fill the set with the contents of a single
directory when the shallow selector is used.
etc.
So anyway. I've been in this modeling space for a while. I looked at
other means of modeling such as SQL and I decided this one was "the
best" for what I was doing.
The results are fast, deterministic, and type-safe with no internal
typecasts or full table scan results.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: FileEntries.h --]
[-- Type: text/x-chdr; name="FileEntries.h", Size: 6532 bytes --]
#ifndef FILEENTRIES_HPP
#define FILEENTRIES_HPP
/* Part of Underdog. https://underdog.sourceforge.net
* Released under GPLv3. Other licenses may be available.
* Robert White © Copyright 2014 <rwhite@pobox.com>
*/
#include <dirent.h> /* Defines DT_* constants */
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <regex.h>
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index/hashed_index.hpp>
#include <boost/multi_index/identity.hpp>
#include <boost/multi_index/member.hpp>
#include <boost/multi_index/composite_key.hpp>
#include <boost/multi_index/tag.hpp>
#include "FileWrapper.h"
namespace FileEntries {
using namespace ::boost;
using namespace ::boost::multi_index;
enum FileType {
UnknownFile,
BlockFile,
CharFile,
DirectoryFile,
PipeFile,
SymlinkFile,
SocketFile,
RegularFile
};
inline
enum FileType
DecodeFileType(const int filetype)
{
enum FileType retval = UnknownFile;
switch (filetype) {
case DT_BLK: retval = BlockFile; break;
case DT_CHR: retval = CharFile; break;
case DT_DIR: retval = DirectoryFile; break;
case DT_FIFO: retval = PipeFile; break;
case DT_LNK: retval = SymlinkFile; break;
case DT_REG: retval = RegularFile; break;
case DT_SOCK: retval = BlockFile; break;
}
return retval;
}
struct FileEntry {
struct NameIndex {};
struct DirectoryIndex {};
struct INodeIndex {};
struct TypedIndex {};
dev_t Device;
long ParentINode;
long INode;
std::string Name;
enum FileType Type;
};
typedef multi_index_container<
FileEntry,
indexed_by<
ordered_non_unique<
tag<FileEntry::NameIndex>,
member<
FileEntry,
std::string,
&FileEntry::Name
>
>,
ordered_non_unique<
tag<FileEntry::DirectoryIndex>,
composite_key<
FileEntry,
member<
FileEntry,
dev_t,
&FileEntry::Device
>,
member<
FileEntry,
long,
&FileEntry::ParentINode
>
>
>,
ordered_non_unique<
tag<FileEntry::INodeIndex>,
composite_key<
FileEntry,
member<
FileEntry,
dev_t,
&FileEntry::Device
>,
member<
FileEntry,
long,
&FileEntry::INode
>
>
>,
ordered_non_unique<
tag<FileEntry::TypedIndex>,
composite_key<
FileEntry,
member<
FileEntry,
enum FileType,
&FileEntry::Type
>,
member<
FileEntry,
std::string,
&FileEntry::Name
>
>
>
>
> FileSet;
struct regex {
regex_t built;
regex(const std::string pattern):
built({})
{
if (regcomp(&built,pattern.c_str(),REG_EXTENDED|REG_NOSUB) != 0) {
throw std::exception();
}
}
~regex()
{
regfree(&built);
}
bool operator()(const std::string & value) const
{
return 0 == regexec(&built,value.c_str(),0,0,0);
}
};
struct selector
{
typedef ::std::pair<bool,bool> result_type;
};
struct shallow_selector:
public selector
{
regex include_pattern;
shallow_selector(const std::string pattern
= ::std::string("^(([^.])|([.][^.])|([.]{2}.))")):
include_pattern(pattern)
{
return;
}
result_type operator()(const FileEntry & ent) const
{
return result_type(include_pattern(ent.Name),false);
}
};
struct deep_selector:
public selector
{
regex include_pattern;
regex recurse_pattern;
deep_selector(
const std::string i_pattern = "^(([^.])|([.][^.])|([.]{2}.))",
const std::string r_pattern = "^(([^.])|([.][^.])|([.]{2}.))"
):
include_pattern(i_pattern),
recurse_pattern(r_pattern)
{
return;
}
result_type
operator()(const FileEntry & ent) const
{
bool candidate = (ent.Type == DirectoryFile) || (ent.Type == SymlinkFile);
return result_type(include_pattern(ent.Name),candidate && recurse_pattern(ent.Name));
}
};
template <
class U,
bool First = true,
bool Second = false
>
struct inverted:
public selector
{
const U & Real;
inverted(const U & real):
Real(real)
{
return;
}
result_type
operator()(const FileEntry & ent) const
{
result_type retval = Real(ent);
if (First) {
retval.first = !retval.first;
}
if (Second) {
retval.second = !retval.second;
}
return retval;
}
};
struct linux_dirent {
long d_inode;
off_t d_offset;
unsigned short d_reclen;
char d_name[1];
};
union cursor_type {
char * ch;
struct linux_dirent * ld;
};
template <class U>
void
Fill(int directory_fd,
FileSet & result,
const U & selector
)
{
FileEntry temp = {};
int count;
char buffer[1024];
char* cursor;
char* extent;
struct stat directory_stat = {};
fstat(directory_fd,&directory_stat);
FileSet::index<FileEntry::DirectoryIndex>::type & result_view =
result.get<FileEntry::DirectoryIndex>();
temp.Device = directory_stat.st_dev;
temp.ParentINode = directory_stat.st_ino;
result_view.erase(
result_view.lower_bound(boost::make_tuple(temp.Device,temp.ParentINode)),
result_view.upper_bound(boost::make_tuple(temp.Device,temp.ParentINode))
);
lseek(directory_fd,0,SEEK_SET);
while ((count = syscall(SYS_getdents,directory_fd,buffer,sizeof(buffer))) > 0) {
int next = 0;
extent = buffer + count;
for (cursor = buffer; cursor < extent; cursor += next) {
linux_dirent * entry = reinterpret_cast<linux_dirent *>(cursor);
next = entry->d_reclen;
temp.INode = entry->d_inode;
temp.Name = entry->d_name;
temp.Type = DecodeFileType(static_cast<int>(*(cursor + next - 1)));
::std::pair<bool,bool> selection = selector(temp);
if (selection.first) {
result_view.insert(temp);
}
if (selection.second) try {
FileWrapper subdir(temp.Name,O_DIRECTORY|O_RDONLY,directory_fd);
Fill(subdir.fd(),result,selector);
} catch (...) {
// No recursion on error, silently, it's just easier.
}
}
}
return;
}
template <>
void
Fill<const std::string &>(int directory_fd,
FileSet & result,
const std::string & selector
)
{
return Fill(directory_fd,result,shallow_selector(selector));
}
inline
void
Fill(int directory_fd,
FileSet & result
)
{
return Fill(directory_fd,result,shallow_selector());
}
}
#endif
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: FileWrapper.h --]
[-- Type: text/x-chdr; name="FileWrapper.h", Size: 1501 bytes --]
#ifndef FILEWRAPPER_HPP
#define FILEWRAPPER_HPP
/* Part of Underdog. https://underdog.sourceforge.net
* Released under GPLv3. Other licenses may be available.
* Robert White © Copyright 2014 <rwhite@pobox.com>
*/
#include <string>
#include <exception>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
struct FileWrapper {
struct BadFile: public std::exception { };
FileWrapper(std::string name, int flags, int base = AT_FDCWD):
FileDescriptor(openat(base,name.c_str(),flags))
{
if (FileDescriptor == -1) {
throw BadFile();
}
}
explicit
FileWrapper(const FileWrapper & X):
FileDescriptor(dup(X.FileDescriptor))
{
if (FileDescriptor == -1) {
throw BadFile();
}
}
explicit
FileWrapper(int other_fd):
FileDescriptor(dup(other_fd))
{
if (FileDescriptor == -1) {
throw BadFile();
}
}
FileWrapper(int other_fd, int new_fd):
FileDescriptor(dup2(other_fd,new_fd))
{
if (FileDescriptor == -1) {
throw BadFile();
}
}
void
Replace(const FileWrapper & X)
{
int newfd = dup(X.FileDescriptor);
if (newfd != -1) {
close(FileDescriptor);
FileDescriptor = newfd;
}
}
~FileWrapper()
{
close(FileDescriptor);
}
operator int() const { return FileDescriptor; }
int fd() const { return FileDescriptor; }
protected:
int FileDescriptor;
private:
FileWrapper & operator =(const FileWrapper & X);
};
#endif
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-03 19:18 ` Robert White
@ 2014-12-04 6:56 ` Qu Wenruo
2014-12-10 21:57 ` Zygo Blaxell
0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2014-12-04 6:56 UTC (permalink / raw)
To: Robert White, linux-btrfs; +Cc: David Sterba
-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Robert White <rwhite@pobox.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>, linux-btrfs
<linux-btrfs@vger.kernel.org>
Date: 2014年12月04日 03:18
> On 12/01/2014 05:17 PM, Qu Wenruo wrote:
>> But I am also somewhat tired of bringing new structure new searching
>> functions or even bring larger change on
>> the btrfsck record infrastructure when I found that can't provide the
>> function when new recovery function is going
>> to be implemented.
>
> DISCLAIMER: My actual thing is modeling, so coming from the idea that
> btrfsck is basically building a model of the filesystem and comparing
> the theoretically correct model to the actual filesystem as found...
> and having spent nearly zero time inside of the btrfsck code base
> itself...
>
> I've looked at what I _think_ you are trying to do with the
> lookup/correlate phase finding peer/parent/child references and I'd
> think other things would make sense far more sense than SQL.
>
> C++ Standard Template Library at a minimum, life is too short to
> reinvent data lookups in memory. But its also too short to re-factor
> SQL queries or wait for degenerate full-table scans.
C++ doesn't comes to me in my first thought, since I haven't use it for
a long long time... :-(
>
> The boost "multi-index container", if you are going to bring in an
> external library, would do you far better for dealing with future
> structural changes. It would directly index the fields of your
> (parent,child,name) relation and let you switch iteration on-the-fly
> (e.g. you can be walking parents and use the same iterator to start
> walking names etc). (there's even a cookbook item for adding an LRU
> cache as a natural index for space-constrained items.) Again you'd
> have to switch over to C++, but it implements exactly what I think you
> are looking for. the multi-index container declaration syntax gets a
> little thick but once declared the usage model is really simple. (you
> pay most-or-all the cost during "update" which is really a
> remove-then-readd but you won't be updating as often as adding so...)
Yeah, C++ seems much easier to switch, but as the preparation, it is
better to cleanup the inode_record codes from
cmds-check.c and put them into a single file.
After the cleanup, I think we can even have different backend to test with.
(Only to test, multi-backend support also seems to be overkilled)
>
> My "crazy-ist idea" would be to maybe mmap all the metadata from the
> filesystem into the process space and build a page-table-like index
> over that. The virtual memory demand paging will become your caching
> buddy. mapped_block+offest gets you directly to the whole native
> structure. You might have to juggle mappings (map and unmap them) on
> systems with arger filesystems but smaller virtual tables (e.g.
> classic pentium 32 bit). Even then, you'd be juggling mappings and the
> kernel would be doing all the real caching. The more data you can
> leave in the mmapped ranges the less storage management takes place in
> the local code.
>
> IMHO, finding a USB stick to make a swap space on is probably just as
> good, if not better than, doing it with a whole other filesystem and
> the "put my temp files there" option during emergency maintenance.
Sadly, there is already user complaining about the memory usage:
http://comments.gmane.org/gmane.comp.file-systems.btrfs/34573
The 13T fs may contain several hundreds GB of metadata, already takes
over 8G memory.
It may takes up 10+ of GB memory, so a memory stick may help but when it
happens on a remote server?
Or can you tolerate the snail like slow IO speed?
So better store them on disk when things go really huge.
Map and unmap will be good,but it may not resolve the problem at all.
The main memory usage in btrfsck is extent record, which
we can't free them until we read them all in and checked, so even we
mmap/unmap, it can only help with
the extent_buffer(which is already freed if not used according to refs).
Now, I miss the page cache in kernel so much...
Anyway, btrfsck should really consider about the memory usage now.
>
>
> If the mailing list engine honors attachments, you will find
> FileEntries.h from one of my projects, offered as a simplified example
> of a directory entry cache implemented in a boost multi-index container.
>
> (Note that this is fully working code from an upcoming addition to my
> soruceforge project, so it's not just theoretical implementation ideas.)
>
> struct FileEntry {
> struct NameIndex {};
> struct DirectoryIndex {};
> struct INodeIndex {};
> struct TypedIndex {};
> dev_t Device;
> long ParentINode;
> long INode;
> std::string Name;
> enum FileType Type;
> };
>
> The empty structs inside the struct act as index identifiers/names
> (Because C++ templates only disambiguate on types so
> FileEntry::NameIndex is the type-name of the index of file names (etc).
>
> Any other data could be added to FileEntry without disrupting any
> existing code for indexing or iterating.
>
> The container itself is a little grusome textually ::
>
> typedef multi_index_container<
> FileEntry,
> indexed_by<
> ordered_non_unique<
> tag<FileEntry::NameIndex>,
> member<
> FileEntry,
> std::string,
> &FileEntry::Name
> >
> >,
> ordered_non_unique<
> tag<FileEntry::DirectoryIndex>,
> composite_key<
> FileEntry,
> member<
> FileEntry,
> dev_t,
> &FileEntry::Device
> >,
> member<
> FileEntry,
> long,
> &FileEntry::ParentINode
> >
> >
> >,
> ordered_non_unique<
> tag<FileEntry::INodeIndex>,
> composite_key<
> FileEntry,
> member<
> FileEntry,
> dev_t,
> &FileEntry::Device
> >,
> member<
> FileEntry,
> long,
> &FileEntry::INode
> >
> >
> >,
> ordered_non_unique<
> tag<FileEntry::TypedIndex>,
> composite_key<
> FileEntry,
> member<
> FileEntry,
> enum FileType,
> &FileEntry::Type
> >,
> member<
> FileEntry,
> std::string,
> &FileEntry::Name
> >
> >
> >
> >
> > FileSet;
>
>
> But after that it's super easy to use, uh, if you know STL iterator
> speak anyway. 8-)
>
> Having a file set is easy and typesafe in any type-able context (such
> as a member of another structure, or a global):
> FileSet result;
>
> Picking yoru index/view has to be done by (admittedly ugly looking) type
> FileSet::index<FileEntry::NameIndex>::type
>
> But then when you make one and plumb it ("result" being the working set
> FileSet::index<FileEntry::NameIndex>::type & by_name =
> result.get<FileEntry::NameIndex>();
>
> by_name is now a fully functional view of the set as an iteraterable
> container with elements like by_name.erase(some_name) and such.
>
> It gets you what you say you want from SQL but without all the
> indirection or duplication of SQL.
>
> (Note that the attached example will fill a FileSet object into memory
> with the entire (sub) tree of a file system if you call Fill() with
> the deep selector; or just fill the set with the contents of a single
> directory when the shallow selector is used.
>
> etc.
>
> So anyway. I've been in this modeling space for a while. I looked at
> other means of modeling such as SQL and I decided this one was "the
> best" for what I was doing.
>
> The results are fast, deterministic, and type-safe with no internal
> typecasts or full table scan results.
Great implement! Almost fits all the needs, except the ability to save
it on disk...
In fact, I have already an idea to implement it in pure C, although not
so generic than C++ template, but should
work anyway.
So the remaining problem will be the ability to cache things on disk for
case like the 13T array.
For C/C++, saving recording to disk will be a simplified version of page
cache, but will still be a huge chanllenge.
For SQL, it will be much easier but I am consider it as the last try
now, since the C++ implement seems quite good.
Anyway, the memory usage problem is still a problem to resolve.
Thanks,
Qu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-04 6:56 ` Qu Wenruo
@ 2014-12-10 21:57 ` Zygo Blaxell
2014-12-11 2:05 ` Qu Wenruo
0 siblings, 1 reply; 18+ messages in thread
From: Zygo Blaxell @ 2014-12-10 21:57 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Robert White, linux-btrfs, David Sterba
[-- Attachment #1: Type: text/plain, Size: 990 bytes --]
On Thu, Dec 04, 2014 at 02:56:55PM +0800, Qu Wenruo wrote:
> The main memory usage in btrfsck is extent record, which
> we can't free them until we read them all in and checked, so even we
> mmap/unmap, it can only help with
> the extent_buffer(which is already freed if not used according to refs).
I'm thinking aloud here, but is it *really* necessary to read everything
into memory? Maybe a multiple-pass algorithm might be possible, e.g. one
to find free space by eliminating any areas that are occupied by extents,
then other passes to rebuild the metadata in the free space. Or, one
pass to verify the connectivity of references and collect dangling refs,
then a second pass which fixes only the dangling refs.
Usually sequential reads are significantly faster than swapping--even
if swapping on solid-state media. It could be that reading 260GB of
metadata sequentially two or three times is still faster than thrashing
through random lookups in 20GB of swap on a 4GB machine.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-10 21:57 ` Zygo Blaxell
@ 2014-12-11 2:05 ` Qu Wenruo
2014-12-11 2:27 ` Zygo Blaxell
0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2014-12-11 2:05 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: Robert White, linux-btrfs, David Sterba
-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Zygo Blaxell <zblaxell@furryterror.org>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Date: 2014年12月11日 05:57
> On Thu, Dec 04, 2014 at 02:56:55PM +0800, Qu Wenruo wrote:
>> The main memory usage in btrfsck is extent record, which
>> we can't free them until we read them all in and checked, so even we
>> mmap/unmap, it can only help with
>> the extent_buffer(which is already freed if not used according to refs).
> I'm thinking aloud here, but is it *really* necessary to read everything
> into memory?
Totally agreed to only read what we need.
But some backref and counts on refs can only be determined after a full
scan, especially for leaf/node corruption
case.
> Maybe a multiple-pass algorithm might be possible, e.g. one
> to find free space by eliminating any areas that are occupied by extents,
> then other passes to rebuild the metadata in the free space. Or, one
> pass to verify the connectivity of references and collect dangling refs,
> then a second pass which fixes only the dangling refs.
I have similar idea, but not multi-pass method, instead, using per
sector scan + tree search for other data.
E.g in extent tree check, each time only record all extents in a block
group, and check them.
After check, remove the good extents/block groups and then move to next
block group.
For fs tree, any key with same objectid(ino) as a group, and only read
the group in one time and remove
the already known healthy record. (info not fully gathered or bad record
will still stay in memory)
But I don't consider this method can really save much memory though...
>
> Usually sequential reads are significantly faster than swapping--even
> if swapping on solid-state media. It could be that reading 260GB of
> metadata sequentially two or three times is still faster than thrashing
> through random lookups in 20GB of swap on a 4GB machine.
>
Definitely, but if we want to reduce memory usage, it is almost
unavoidable to do more disk IO, especially random
disk IO, so it will become a tradeoff, which may cause the already slow
fsck more slow....
Thanks,
Qu
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-11 2:05 ` Qu Wenruo
@ 2014-12-11 2:27 ` Zygo Blaxell
0 siblings, 0 replies; 18+ messages in thread
From: Zygo Blaxell @ 2014-12-11 2:27 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Robert White, linux-btrfs, David Sterba
[-- Attachment #1: Type: text/plain, Size: 1489 bytes --]
On Thu, Dec 11, 2014 at 10:05:20AM +0800, Qu Wenruo wrote:
>
> -------- Original Message --------
> Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
> From: Zygo Blaxell <zblaxell@furryterror.org>
> To: Qu Wenruo <quwenruo@cn.fujitsu.com>
> Date: 2014年12月11日 05:57
> >On Thu, Dec 04, 2014 at 02:56:55PM +0800, Qu Wenruo wrote:
> >>The main memory usage in btrfsck is extent record, which
> >>we can't free them until we read them all in and checked, so even we
> >>mmap/unmap, it can only help with
> >>the extent_buffer(which is already freed if not used according to refs).
> >I'm thinking aloud here, but is it *really* necessary to read everything
> >into memory?
> Totally agreed to only read what we need.
> But some backref and counts on refs can only be determined after a
> full scan, especially for leaf/node corruption
> case.
It might be faster (and smaller) to pipe them out to sort (with gzip/lzma
compression on temporary files) than to try to insert them in a tree.
I have used that technique in some of my deduplicating programs. It can
cut the working set size by several orders of magnitude (trading it for
an O(n log n) sort, which will mostly read and write sequentially).
e.g. duplicate refs will all sort together, so when you are sequentially
reading the sorted data and the current key value changes, you know you've
seen everything that could be a duplicate, and can discard everything
in RAM.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-02 0:37 ` Qu Wenruo
@ 2014-12-11 19:00 ` Martin Steigerwald
0 siblings, 0 replies; 18+ messages in thread
From: Martin Steigerwald @ 2014-12-11 19:00 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Austin S Hemmelgarn, linux-btrfs
Hi Qu,
Am Dienstag, 2. Dezember 2014, 08:37:44 schrieb Qu Wenruo:
> -------- Original Message --------
> Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
> From: Austin S Hemmelgarn <ahferroin7@gmail.com>
> To: Qu Wenruo <quwenruo@cn.fujitsu.com>, linux-btrfs
> <linux-btrfs@vger.kernel.org>
> Date: 2014年12月01日 20:53
>
> > On 2014-11-30 20:58, Qu Wenruo wrote:
> >> [snipped]
> >
> > So, I think this does a good job of highlighting one of the bigger
> > issues with btrfsck when it is compared to ext* and/or xfs. Despite
> > this being a problem, I really don't think using a rdbms is the way to
> > fix it, both for reasons outlined in other responses, and because fsck
> > should be as fast as possible when nothing is wrong with the fs.
>
> Although I am not stick to the crazy idea, I think it is still needed to
> point out that,
> even btrfsck is ran on a clean btrfs, it still needs to iterate all the
> extents, metadata.
> So it may not be as fast as you thought even with the current implement.
>
> Anyway thanks for the feedback.
As to my knowledge at some time xfs_repair struggled with memory issues as
well. And the developers of it made an attempt at reducing memory usage. I
think with e2fsck this has been an issue as well. So maybe might be an idea at
the approaches the devs of these two filesystem checking tools used to reduce
memory usage.
On the other hand it may be that XFS and Ext4 is just too different to BTRFS to
adapt those approaches. But in the end they are all filesystems, so there may
be enough similarities to port over some ideas.
Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
2014-12-01 1:58 Crazy idea of cleanup the inode_record btrfsck things with SQL? Qu Wenruo
` (2 preceding siblings ...)
2014-12-01 12:53 ` Austin S Hemmelgarn
@ 2014-12-11 19:38 ` Roger Binns
3 siblings, 0 replies; 18+ messages in thread
From: Roger Binns @ 2014-12-11 19:38 UTC (permalink / raw)
To: linux-btrfs
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 11/30/2014 05:58 PM, Qu Wenruo wrote:
> 2. Heavy dependency If use it, btrfs-progs will include RDBMS as
> the make and runtime dependency. Such low level progs depend on
> high level programs like sqlite3 may be very strange.
BTW SQLite is designed as a library. It is shipped as a single file
with the deliberate intention you add the sqlite3.c file to your
project. For private internal tool use you don't need to depend or
use the system SQLite in any way.
https://www.sqlite.org/amalgamation.html
SQLite also lets you easily define collation sequences, your own
functions and virtual tables which will make all this easier. Using
pragmas you can control use of memory and disk space for operation.
Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iEYEARECAAYFAlSJ8rAACgkQmOOfHg372QRzJACgoip0vhoM0XEkVIB9/ZggPXX1
PuMAn3/0lP+SQyxDh6UFStt5hlA2Wwkz
=b5GF
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2014-12-11 19:38 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-01 1:58 Crazy idea of cleanup the inode_record btrfsck things with SQL? Qu Wenruo
2014-12-01 3:08 ` Duncan
2014-12-01 3:24 ` Qu Wenruo
2014-12-01 5:47 ` Duncan
2014-12-01 6:25 ` Qu Wenruo
2014-12-01 4:03 ` Robert White
2014-12-01 6:18 ` Qu Wenruo
2014-12-01 18:10 ` Robert White
2014-12-02 1:17 ` Qu Wenruo
2014-12-03 19:18 ` Robert White
2014-12-04 6:56 ` Qu Wenruo
2014-12-10 21:57 ` Zygo Blaxell
2014-12-11 2:05 ` Qu Wenruo
2014-12-11 2:27 ` Zygo Blaxell
2014-12-01 12:53 ` Austin S Hemmelgarn
2014-12-02 0:37 ` Qu Wenruo
2014-12-11 19:00 ` Martin Steigerwald
2014-12-11 19:38 ` Roger Binns
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox