Is it per DAT based, or can you run across multiple DATs? i.e. to check duplicates ACROSS DATs, if that makes sense...
You give the filenames to read and the rest is to the program. It reads all the given files and dump roms, check for duplicates based on SHA1+CRC and you can check the output. Not error prone and just a simple script with a very a quite cryptic output but it is works. :-) And yes, it works across multiple dats. It can parse MAME as well, so you can cross check between MAME and TOSEC. You do not want to I think, this is just an extra.
The shell script part, using the Nintendo Famicon dats:
#!/bin/bash -eu
./DatFile_Converter.py dats/Nintendo*Famicom*.dat \
| tee z01 \
| sort -i \
| uniq -w 50 --all-repeated=separate \
| tee z01a \
| cut -f6 \
> z01z
z01 - raw output of python script, includes all rom entry from all dat you specified, just for debug
z01a - duplicates output, six columns: SHA1, CRC, rom entry length, MD5, number of rom entry within a game entry, dat filename:game name/rom name
z01z - duplicate output, the last column of z01a
The python part, use to parse XML files and output the result, use the same filename as in shell script (like DatFile_Converter.py):
#!/usr/bin/env python3
import sys
from lxml import etree
def main():
allroms = list()
for f in sys.argv[1:]:
print(f'--- reading file {f}', file=sys.stderr)
tree = etree.parse(source=f)
print(f'-- parsing file {f}', file=sys.stderr)
sets = tree.xpath('//machine|game')
print(f'-- processing file {f} with {len(sets)} sets', file=sys.stderr)
for set in sets:
setofrom = set.get('name')
roms = [c for c in set if c.tag == 'rom']
rominsetcount = len(roms)
if rominsetcount > 1: ## filter by rom/set
continue
for rom in roms:
romname = rom.get('name')
romsha1 = rom.get('sha1')
rommd5 = rom.get('md5')
romcrc = rom.get('crc')
romsize = int(rom.get('size'))
if romsize < 16384: ## filter by size
continue
print(f"{romsha1}\t{romcrc}\t{romsize}\t{rommd5}\t{rominsetcount}\t{f}\t{setofrom}/{romname}")
allroms.append(rom)
## allroms.sort(key=lambda rom: rom.get('sha1') + rom.get('crc'), reverse=False)
print(f"=== allroms size is {len(allroms)}", file=sys.stderr)
print(f"=== {allroms[0].get('sha1')}|{allroms[0].getparent().get('name')}|{allroms[0].get('name')}", file=sys.stderr)
if __name__ == '__main__':
main()
If you see there is only one rom within a game and there is duplicates you think that is a real duplicates. If you see more than one rom entry within a game entry you need to check out the rest as this util is not able to compare game entry with game entry only rom entry with rom entry. So there is still room for improvement I know. Again this is just a quick hack to learn python and XML processing but with useful output. :-)
Update: modified the python part a little. Moved out the variables from print statement and therefore you can filter your included lines not only its size but number of rom entries within game entry (now every rom entry included which is bigger than 0 byte and there is no more rom than one within its game entry, this is a quite good starting point to find real duplicates). If you would like to change the filter parameters you have to change the source.
Update: simplified xpath. Now it can parse tosec, mame and yori XML files with one xpath. As yori files are huge the parsing is bloody slow.
Update: using a different approach to process the xml file and it gives a huuge improvement to yori dats processing. starting to move sorting and duplicate finding logic into python script so you do not need any external utility. very embryonic and not working yet. might never will. :-)