(don't have much time to detail some parts a lot atm
)
I think you already understood that time is the key and since this is a hobby for all of us it is hard to start something very complex, it has to start with small steps (at least this is my view and what i've done). The next huge problem is how TOSEC and other collections work out, you will have to pick info from dats that may change and have mistakes in most of the cases, you can't control most of the rules and change them when they are bad and a small change somewhere may end up forcing a schema change or something that takes time.
The most important part is planing it right but in my opinion the deep using on it and complexity depends on the team developing it, in this project i did some planing but nothing huge or always documented because it is just me and i would end up wasting all my time planing something and doing zero, so i started with a big idea but some aspects not well defined, was just like a big test that was getting more and more updated adding some parts to it, now i've come to a point that shows me that i should just rewrite this core properly so i can use it as a base for something and not keeping rewriting and duplicating code in an adhoc development
Now, a bit more about your last post, i'm not sure i understood it all (my english isn't great):
Storing releases information (and so datfiles history) is a great idea that i haven't ended yet and don't have any idea if i will ever have time to do so (i grabbed some older packs, have some olders tncs, tugids, etc but for now just have small stats about these packs and not full information).
The main problems with that idea that i've recall and can remember just now go from the huge size it would take in the DB, to the millions of errors you find in old dats and also information that is impossible to parse.
I mean, TNC changed a lot and the older sets have flags not recognized now, information wouldn't be parsed there until a parser for that was done (and there is no great documentation about the older rules + they weren't always followed).
Next, if by any mean we happened to extract all information from flags you would end up with a TON of invalid values, from invalid dates that are easy to check to completely invalid groups of information like: inexistent publishers that are just garbage and have been fixed, descriptors, countries and language codes that are invalid and don't even exist, scene groups, persons with name not written accordingly "(Last Name, Name)" and so on, resuming a ton of errors. In a db that covers all details of a set, you will need to have this kind of information, having older sets details too means all typos and errors that got once in one of these flags needed to be added, having a list of scene groups where more than half would need to be tagged as invalid/errors.
In my view that is a little too much, too much work, too much information (half unneeded) and so my planned approach would be having that many details only for the current/actual datfiles, older versions of them should only have datfile details and stats, eventually also the existent setnames and roms but not covering flag details for each of the older sets.
After that you have to take in consideration that datfile names change in time, to fix company or system names that were wrong, changes that need to be manually tracked (you would need to tell that COMPANY System128 - Dat (2000-00-00) was an older version of "Company System 128 - Dat (2001-00-00) where the datfile was renamed, or correctly marking category changes, datfiles that were merged or split because they are not always (and we can see that along the time) a direct update where only version changed. Identifiying this automatically is near impossible or will be dangerous, causing errors.
Datfile updates like you described would be easy but would also generate an huge db with tons of duplicated sets, we have near 350.000 setfiles now, each one with AT LEAST one image (software image, not picture), one title, year, publisher, crc, md5/sha1, filesize, filename, extension and tons of extra info, having it all that info for the last release already takes tons of table entries and MBs, thinking about dozens of releases where all this info would be immense.
IMO this will lead to an idea of storing datfiles and setfile names, with setdetails only for the latest dats (or sets that didn't change much). To avoid such duplication, identify changes would be needed, so adding 2 versions of the Amiga Games ADF dat would not add 2x 20000 sets but 20000 sets first and then 1000 new ones + 500 renamed ones and so on (+ rom changes + bla bla). Even that is complex already for the available time of most, dealing with datfile renames and separation in more than one dat, set moves between dats and so on is also complicated.
It is not that hard to figure out setfile renames or moves on single rom sets just by comparing hashes+size, but in multi rom sets things get harder and harder cause of shared files between sets (crackintros, readmes, something else), redumps of some files and so on that make it impossible to always know if sets are related or not.
Adding even more stuff community related will just make it even more complex, you can have those goals but you need to define what is the basic, important part and what should be done later if still a good idea.
My view is that, currently the important part for me and the project is a way to easily browse and relate information existent (systems, companies, publishers, groups, etc, etc) in the latest existent datfiles so they can be checked and renamers can fix the identified errors, if possible adding a bit of information on older releases (+ thinking of a WIP system easier for renamers one day), this is what i've done lately (when i had more free time, currently it is in need of urgent rewrites so i don't waste so much time by repeating stuff + make it securer).
I will not talk about any technical aspects of tools or so, just answered SQLite for an app because it seemed good to use with options like Qt & c++, that is not relevant for now anyway.
...and note that i didn't even talked about the problems with using datfile values, for example with publishers you just have a string there (name), there are a lot of publishers that may have shared same name, this is really bad when happening in the same system (duplicated person names, sceners, and so on), also adding details for setfiles is complicated when you later will end loading newer setfiles.
That's it, hope you can get something out of this pile of text, also if you like i can show you what we've got now, just pm me.