Git for structured data

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Git for structured data
@ 2025-12-05 16:51 Cedric Sodhi
  2025-12-06 16:27 ` René Scharfe
  2025-12-07  5:26 ` Simon Richter
  0 siblings, 2 replies; 6+ messages in thread
From: Cedric Sodhi @ 2025-12-05 16:51 UTC (permalink / raw)
  To: git

Hello (from off list),

a filesystem of Git's working directory type can be seen as a type of database. Compared to other types of databases (relational or not), it might even be considered a fairly complex database with arbitrary nesting depth and relational semantics through symbolic links.

Git excels at version control of this specific type of database, the filesystem. Yet, Git can't be used as-is to version control any other type of database; even though they might be simpler, semantically.

We can have structured data (databases with schemas). We can have version controlled data (files with Git).

Why can't we have structured, version controlled data?

In recent years I've repeatedly struck cases where exactly that was needed. For amounts of data which are comparable to what you typically version with git; only structured. Without workarounds, either structure (table schemas) or versioning (Git) had to be sacrificed. Which is disappointing, in my opinion, seen how this only hinges on the type of source Git would have to read the data from.

I'd like to ask your opinion, on what you think is the most promising approach to unify structure and version control with Git. Currently, I can think of two, kind of complementary options:

A) Map structured data into a filesystem, possibly through FUSE, then version control that with Git.
Pros: Can mix non-structured data and structured data.
Cons: Expect terrible performance

B) Abstract Git's data backend to allow Git to read directly from databases
Pros: Perhaps reasonable performance
Cons: Additional changes to Git would be needed to allow mixing data.

What would you recommend?

Kind regards,
Cedric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git for structured data
  2025-12-05 16:51 Git for structured data Cedric Sodhi
@ 2025-12-06 16:27 ` René Scharfe
  2025-12-06 18:47   ` Cedric Sodhi
  2025-12-07  5:26 ` Simon Richter
  1 sibling, 1 reply; 6+ messages in thread
From: René Scharfe @ 2025-12-06 16:27 UTC (permalink / raw)
  To: Cedric Sodhi, git

On 12/5/25 5:51 PM, Cedric Sodhi wrote:
> 
> Why can't we have structured, version controlled data?
> 
> In recent years I've repeatedly struck cases where exactly that was
> needed. For amounts of data which are comparable to what you
> typically version with git; only structured. Without workarounds,
> either structure (table schemas) or versioning (Git) had to be
> sacrificed. Which is disappointing, in my opinion, seen how this
> only hinges on the type of source Git would have to read the data
> from.
> 
> I'd like to ask your opinion, on what you think is the most
> promising approach to unify structure and version control with Git.
> Currently, I can think of two, kind of complementary options:
> 
> A) Map structured data into a filesystem, possibly through FUSE,
> then version control that with Git. Pros: Can mix non-structured
> data and structured data. Cons: Expect terrible performance
> 
> B) Abstract Git's data backend to allow Git to read directly from
> databases Pros: Perhaps reasonable performance Cons: Additional
> changes to Git would be needed to allow mixing data.
> 
> What would you recommend?
Did you consider Data Version Control (https://dvc.org/) or Dolt
(https://github.com/dolthub/dolt)?  Not a recommendation, since I
haven't used them myself, but they match your description and call
themselves "Git for data".

René


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git for structured data
  2025-12-06 16:27 ` René Scharfe
@ 2025-12-06 18:47   ` Cedric Sodhi
  2025-12-06 21:02     ` Christian Couder
  0 siblings, 1 reply; 6+ messages in thread
From: Cedric Sodhi @ 2025-12-06 18:47 UTC (permalink / raw)
  To: René Scharfe; +Cc: git

On Sat, Dec 06, 2025 at 05:27:11PM +0100, René Scharfe wrote:
> Did you consider Data Version Control (https://dvc.org/) or Dolt
> (https://github.com/dolthub/dolt)?  Not a recommendation, since I
> haven't used them myself, but they match your description and call
> themselves "Git for data".
> 
> René

Hello and thank you for the two suggestions. I've read up on them and came to the following understanding. But first, I would like to mention that by "data" that needs to be versioned, I was not referring to binary (opaque) data, but rather exactly the type of data which Git currently manages ("source code", in a sense); but in a structured form. Think text or sourcecode fragments in an SQL database.

DVC, although different, seems to be similar to GitLFS and its focus on managing large, opaque data (binary blobs) as opposed to small, transparent data (textfiles). Essentially, it is meant to overcome Git's lack of performance with large files. I therefore think that it does not match my goal.

Dolt appears to fit the functional description. But while it expose a Git-like CLI, it seems to be neither based on Git, nor derived from it. Also, its software architecture is largely monolithic as it bundles its own SQL server, which is makes two-fold dependend on foreign code (Git for the interface, SQL for the database).

Cedric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git for structured data
  2025-12-06 18:47   ` Cedric Sodhi
@ 2025-12-06 21:02     ` Christian Couder
  0 siblings, 0 replies; 6+ messages in thread
From: Christian Couder @ 2025-12-06 21:02 UTC (permalink / raw)
  To: Cedric Sodhi; +Cc: René Scharfe, git

On Sat, Dec 6, 2025 at 7:48 PM Cedric Sodhi <manday@openmail.cc> wrote:

> Hello and thank you for the two suggestions. I've read up on them and came to the following understanding. But first, I would like to mention that by "data" that needs to be versioned, I was not referring to binary (opaque) data, but rather exactly the type of data which Git currently manages ("source code", in a sense); but in a structured form. Think text or sourcecode fragments in an SQL database.

Not sure it's what you are looking for but https://fossil-scm.org
stores its content in an SQLite database.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git for structured data
  2025-12-05 16:51 Git for structured data Cedric Sodhi
  2025-12-06 16:27 ` René Scharfe
@ 2025-12-07  5:26 ` Simon Richter
  2025-12-07 17:23   ` Cedric Sodhi
  1 sibling, 1 reply; 6+ messages in thread
From: Simon Richter @ 2025-12-07  5:26 UTC (permalink / raw)
  To: Cedric Sodhi, git

[-- Attachment #1.1: Type: text/plain, Size: 2611 bytes --]

Hi,

On 12/6/25 01:51, Cedric Sodhi wrote:

> Why can't we have structured, version controlled data?

You can version control inside a relational database, by adding valid 
time columns with a range-between-timestamps type and a constraint to 
disallow overlaps. There are good indexing techniques, the first thing 
that springs to mind is [1], but I'm fairly sure there are others, and a 
modern RDBMS should provide constraints on range types.

A valid time column can encode either "time at which the data is valid", 
or "time at which the data was current in the database", with two 
columns, you can encode both at the same time.

If you hide the "data is current within" column behind a view and 
automatically update it, this creates the historical log of when an 
entry was updated.

Tracking arbitrary data in git is, of course, also possible, but 
requires diff/merge tools adequate for the data. The built-in tools are 
adequate for the main use case, text files that usually change on a 
line-by-line basis and are seldom reorganized as a whole, so we can 
pretend they are one-dimensional.

In KiCad, the files we generate describe a three-dimensional structure. 
No matter how we normalize the file contents, elements can only be moved 
on one axis without requiring us to move them to a different position in 
the file.

So if I sort by z,y,x, then moving an object to a different z coordinate 
likely results in "deletion" of the old object at the existing place, 
and "creation" of a new object at a different place in the file, the 
one-dimensional diff algorithm is unable to create a minimal diff here 
that shows that only the z coordinate changed.

Not sorting (i.e. leaving elements in creation order) means that 
deleting and recreating an object with the same parameters causes it to 
move within the file.

The solution is to treat the serialized representation as just that, a 
serialization, and not try to interpret order in any meaningful way, but 
this requires dedicated diff/patch tools and heuristics that guess 
whether deleting and creating similar objects constitutes a move or if 
the objects are unrelated, same as git does in its move detection.

I think that diff/merge on relational data is more difficult than 
expressing history inside the relational tables. For other data 
structures, this may be different, and git might be a viable storage 
method for history -- but in any case it requires the effort to build an 
appropriate plug-in.

    Simon

[1] https://link.springer.com/chapter/10.1007/BFb0054512

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git for structured data
  2025-12-07  5:26 ` Simon Richter
@ 2025-12-07 17:23   ` Cedric Sodhi
  0 siblings, 0 replies; 6+ messages in thread
From: Cedric Sodhi @ 2025-12-07 17:23 UTC (permalink / raw)
  To: Simon Richter; +Cc: git

Hi Simon

If your suggestion at this point would be that I consider implementing an VCS in the database instead of basing it on Git, I'll be sceptical. I'd end up re-implementing Git's features.

I agree with many things you say. There is no magic recipe to apply Git to relational databases; specific tools -- at the very least one per database type, but possibly tailored further to the specific data it holds -- would have to be written.

However, I do think Git generalizes to RDBS more readily than it may seem and, in fact, one method to map a DB into a filesystem-isomorphic thing which Git knows how to handle, would fit 99% of all cases. Your example from KiCad (which can be understood as content which are stored in a RDBS), could be a good illustration:

If you normalize the file (one element per line) by sorting by UIDs corresponding to the individual elements, then you'd see no diff unless either UID or contents change. And UIDs typically wouldn't change unless you actually delete-and-recreate something. Every table in the DB which is normalized that way could be mapped as a single file. Of course a more granular mapping table-file or even row-file would be possible. In fact, mapping one row (or element of the Schema/Layout) per file would exploit Git's ability to detect "moves" meaning that you wouldn't even need UIDs for the elements to create good diffs. There is power in the semantics of the filesystem hierarchy which you lose when all contents become a single database/KiCad-file.

In short: In my opinion, there really doesn't seem to be any algorithmic difficulty. The only thing that stops Git from doing praticable versioning of databases is its inability to access its contents transparently in any of most trivial manners.

Best,
Cedric

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-12-07 17:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-05 16:51 Git for structured data Cedric Sodhi
2025-12-06 16:27 ` René Scharfe
2025-12-06 18:47   ` Cedric Sodhi
2025-12-06 21:02     ` Christian Couder
2025-12-07  5:26 ` Simon Richter
2025-12-07 17:23   ` Cedric Sodhi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).