* Git for structured data @ 2025-12-05 16:51 Cedric Sodhi 2025-12-06 16:27 ` René Scharfe 2025-12-07 5:26 ` Simon Richter 0 siblings, 2 replies; 6+ messages in thread From: Cedric Sodhi @ 2025-12-05 16:51 UTC (permalink / raw) To: git Hello (from off list), a filesystem of Git's working directory type can be seen as a type of database. Compared to other types of databases (relational or not), it might even be considered a fairly complex database with arbitrary nesting depth and relational semantics through symbolic links. Git excels at version control of this specific type of database, the filesystem. Yet, Git can't be used as-is to version control any other type of database; even though they might be simpler, semantically. We can have structured data (databases with schemas). We can have version controlled data (files with Git). Why can't we have structured, version controlled data? In recent years I've repeatedly struck cases where exactly that was needed. For amounts of data which are comparable to what you typically version with git; only structured. Without workarounds, either structure (table schemas) or versioning (Git) had to be sacrificed. Which is disappointing, in my opinion, seen how this only hinges on the type of source Git would have to read the data from. I'd like to ask your opinion, on what you think is the most promising approach to unify structure and version control with Git. Currently, I can think of two, kind of complementary options: A) Map structured data into a filesystem, possibly through FUSE, then version control that with Git. Pros: Can mix non-structured data and structured data. Cons: Expect terrible performance B) Abstract Git's data backend to allow Git to read directly from databases Pros: Perhaps reasonable performance Cons: Additional changes to Git would be needed to allow mixing data. What would you recommend? Kind regards, Cedric ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Git for structured data 2025-12-05 16:51 Git for structured data Cedric Sodhi @ 2025-12-06 16:27 ` René Scharfe 2025-12-06 18:47 ` Cedric Sodhi 2025-12-07 5:26 ` Simon Richter 1 sibling, 1 reply; 6+ messages in thread From: René Scharfe @ 2025-12-06 16:27 UTC (permalink / raw) To: Cedric Sodhi, git On 12/5/25 5:51 PM, Cedric Sodhi wrote: > > Why can't we have structured, version controlled data? > > In recent years I've repeatedly struck cases where exactly that was > needed. For amounts of data which are comparable to what you > typically version with git; only structured. Without workarounds, > either structure (table schemas) or versioning (Git) had to be > sacrificed. Which is disappointing, in my opinion, seen how this > only hinges on the type of source Git would have to read the data > from. > > I'd like to ask your opinion, on what you think is the most > promising approach to unify structure and version control with Git. > Currently, I can think of two, kind of complementary options: > > A) Map structured data into a filesystem, possibly through FUSE, > then version control that with Git. Pros: Can mix non-structured > data and structured data. Cons: Expect terrible performance > > B) Abstract Git's data backend to allow Git to read directly from > databases Pros: Perhaps reasonable performance Cons: Additional > changes to Git would be needed to allow mixing data. > > What would you recommend? Did you consider Data Version Control (https://dvc.org/) or Dolt (https://github.com/dolthub/dolt)? Not a recommendation, since I haven't used them myself, but they match your description and call themselves "Git for data". René ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Git for structured data 2025-12-06 16:27 ` René Scharfe @ 2025-12-06 18:47 ` Cedric Sodhi 2025-12-06 21:02 ` Christian Couder 0 siblings, 1 reply; 6+ messages in thread From: Cedric Sodhi @ 2025-12-06 18:47 UTC (permalink / raw) To: René Scharfe; +Cc: git On Sat, Dec 06, 2025 at 05:27:11PM +0100, René Scharfe wrote: > Did you consider Data Version Control (https://dvc.org/) or Dolt > (https://github.com/dolthub/dolt)? Not a recommendation, since I > haven't used them myself, but they match your description and call > themselves "Git for data". > > René Hello and thank you for the two suggestions. I've read up on them and came to the following understanding. But first, I would like to mention that by "data" that needs to be versioned, I was not referring to binary (opaque) data, but rather exactly the type of data which Git currently manages ("source code", in a sense); but in a structured form. Think text or sourcecode fragments in an SQL database. DVC, although different, seems to be similar to GitLFS and its focus on managing large, opaque data (binary blobs) as opposed to small, transparent data (textfiles). Essentially, it is meant to overcome Git's lack of performance with large files. I therefore think that it does not match my goal. Dolt appears to fit the functional description. But while it expose a Git-like CLI, it seems to be neither based on Git, nor derived from it. Also, its software architecture is largely monolithic as it bundles its own SQL server, which is makes two-fold dependend on foreign code (Git for the interface, SQL for the database). Cedric ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Git for structured data 2025-12-06 18:47 ` Cedric Sodhi @ 2025-12-06 21:02 ` Christian Couder 0 siblings, 0 replies; 6+ messages in thread From: Christian Couder @ 2025-12-06 21:02 UTC (permalink / raw) To: Cedric Sodhi; +Cc: René Scharfe, git On Sat, Dec 6, 2025 at 7:48 PM Cedric Sodhi <manday@openmail.cc> wrote: > Hello and thank you for the two suggestions. I've read up on them and came to the following understanding. But first, I would like to mention that by "data" that needs to be versioned, I was not referring to binary (opaque) data, but rather exactly the type of data which Git currently manages ("source code", in a sense); but in a structured form. Think text or sourcecode fragments in an SQL database. Not sure it's what you are looking for but https://fossil-scm.org stores its content in an SQLite database. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Git for structured data 2025-12-05 16:51 Git for structured data Cedric Sodhi 2025-12-06 16:27 ` René Scharfe @ 2025-12-07 5:26 ` Simon Richter 2025-12-07 17:23 ` Cedric Sodhi 1 sibling, 1 reply; 6+ messages in thread From: Simon Richter @ 2025-12-07 5:26 UTC (permalink / raw) To: Cedric Sodhi, git [-- Attachment #1.1: Type: text/plain, Size: 2611 bytes --] Hi, On 12/6/25 01:51, Cedric Sodhi wrote: > Why can't we have structured, version controlled data? You can version control inside a relational database, by adding valid time columns with a range-between-timestamps type and a constraint to disallow overlaps. There are good indexing techniques, the first thing that springs to mind is [1], but I'm fairly sure there are others, and a modern RDBMS should provide constraints on range types. A valid time column can encode either "time at which the data is valid", or "time at which the data was current in the database", with two columns, you can encode both at the same time. If you hide the "data is current within" column behind a view and automatically update it, this creates the historical log of when an entry was updated. Tracking arbitrary data in git is, of course, also possible, but requires diff/merge tools adequate for the data. The built-in tools are adequate for the main use case, text files that usually change on a line-by-line basis and are seldom reorganized as a whole, so we can pretend they are one-dimensional. In KiCad, the files we generate describe a three-dimensional structure. No matter how we normalize the file contents, elements can only be moved on one axis without requiring us to move them to a different position in the file. So if I sort by z,y,x, then moving an object to a different z coordinate likely results in "deletion" of the old object at the existing place, and "creation" of a new object at a different place in the file, the one-dimensional diff algorithm is unable to create a minimal diff here that shows that only the z coordinate changed. Not sorting (i.e. leaving elements in creation order) means that deleting and recreating an object with the same parameters causes it to move within the file. The solution is to treat the serialized representation as just that, a serialization, and not try to interpret order in any meaningful way, but this requires dedicated diff/patch tools and heuristics that guess whether deleting and creating similar objects constitutes a move or if the objects are unrelated, same as git does in its move detection. I think that diff/merge on relational data is more difficult than expressing history inside the relational tables. For other data structures, this may be different, and git might be a viable storage method for history -- but in any case it requires the effort to build an appropriate plug-in. Simon [1] https://link.springer.com/chapter/10.1007/BFb0054512 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Git for structured data 2025-12-07 5:26 ` Simon Richter @ 2025-12-07 17:23 ` Cedric Sodhi 0 siblings, 0 replies; 6+ messages in thread From: Cedric Sodhi @ 2025-12-07 17:23 UTC (permalink / raw) To: Simon Richter; +Cc: git Hi Simon If your suggestion at this point would be that I consider implementing an VCS in the database instead of basing it on Git, I'll be sceptical. I'd end up re-implementing Git's features. I agree with many things you say. There is no magic recipe to apply Git to relational databases; specific tools -- at the very least one per database type, but possibly tailored further to the specific data it holds -- would have to be written. However, I do think Git generalizes to RDBS more readily than it may seem and, in fact, one method to map a DB into a filesystem-isomorphic thing which Git knows how to handle, would fit 99% of all cases. Your example from KiCad (which can be understood as content which are stored in a RDBS), could be a good illustration: If you normalize the file (one element per line) by sorting by UIDs corresponding to the individual elements, then you'd see no diff unless either UID or contents change. And UIDs typically wouldn't change unless you actually delete-and-recreate something. Every table in the DB which is normalized that way could be mapped as a single file. Of course a more granular mapping table-file or even row-file would be possible. In fact, mapping one row (or element of the Schema/Layout) per file would exploit Git's ability to detect "moves" meaning that you wouldn't even need UIDs for the elements to create good diffs. There is power in the semantics of the filesystem hierarchy which you lose when all contents become a single database/KiCad-file. In short: In my opinion, there really doesn't seem to be any algorithmic difficulty. The only thing that stops Git from doing praticable versioning of databases is its inability to access its contents transparently in any of most trivial manners. Best, Cedric ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-12-07 17:24 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-05 16:51 Git for structured data Cedric Sodhi 2025-12-06 16:27 ` René Scharfe 2025-12-06 18:47 ` Cedric Sodhi 2025-12-06 21:02 ` Christian Couder 2025-12-07 5:26 ` Simon Richter 2025-12-07 17:23 ` Cedric Sodhi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).