Ok so here's an recap of the work done so far on the unicode correction project.
First step was to repair RomVault one more (last?) time so that it better manage the Unicode UTF-8 Format.
That first step had led to an rebuild of 7 zips i had wrong on My PC & on on the first version of the PD Torrent.
But some errors was again discovered by some Various PD Users.
So i have bring the discution to Romvault Discord for other people to join the diagnose and work flow of the correction.
On the first batch of correction that led to the 2nd PD Torrent release, the Dats were actualy good, encoded in UTF-8 fine with the right characters.
From that step, every problem discovered were actually on the Dats themselves, with usualy Nordic & German letters repalced by garbage.
Those are on a very few number of dats, mostly Commodore ones.
Good news is that the Discord People reverse engeniered that actually happened during the DAT release process and managed to know what was the badly encoded name in the first place
it's unicode read in IBM Encoding 437 by error... (maybe in a non unicode - compliant tool) and then reencoded as unicode
we're still working on this ... but here are the results so far :
fix("┬ú", "£"); // 0x00a3
fix("┬░", "°"); // 0x00b0
fix("┬┤", "´"); // 0x00b4
fix("-¦", "´"); // 0x00b4
fix("╪", "Ø"); // 0x00d8
fix("+ÿ", "Ø"); // 0x00d8
fix("├ƒ", "ß"); // 0x00df
fix("├ñ", "ä"); // 0x00e4
fix("├Ñ", "å"); // 0x00e5
fix("├ª", "æ"); // 0x00e6
fix("≈", "ö"); // 0x00f6
fix("├╢", "ö"); // 0x00f6
fix("├╕", "ø"); // 0x00f8
fix("├║", "ú"); // 0x00fa
fix("├╝", "ü"); // 0x00fc
fix("┼é", "ł"); // 0x0142
fix("┼ô", "œ"); // 0x0153
fix("ΓÇÖ", "’"); // 0x2019
fix("ΓÇô", "–"); // 0x2013
the Ø is badly corrected on that list, both time it's the capped letter ... I have let the Discord Know ... but everyoen asleep right now :clin_d'œil: (modifié)
Now thoses errors have only been found on 6 Dats :
Commodore C64 - Demos - [PRG] (TOSEC-v2020-06-28_CM).dat
Commodore C64 - Graphics - [PRG] (TOSEC-v2020-06-28_CM).dat
Commodore C64 - Music - [D64] (TOSEC-v2020-07-03_CM).dat
Commodore C64 - Music - [PRG] (TOSEC-v2020-06-28_CM).dat
Commodore C64 - Music - [T64] (TOSEC-v2020-06-28_CM).dat
Commodore C128 - Applications - CPM - [D71] (TOSEC-v2018-09-22_CM).dat
I have the corrected version of the dats and if i counted well, there were only 52 roms with bad names
BUT
we have found the same kind of problems ... that have led to the creation that characters that don't even translate to something visible in UTF-8... Theses are supposely control characters, not displayable ones.
First one found is :
--
"Sinclair ZX Spectrum - Games - [TAP]" has DELETE [7F] character in a set, how does that even happen
ok, aside from that one, the only other weird one i've spotted is in that same dat "Petris (1996)(PTsoft)(ES)(en)[И].tap"
can't find И in the list of tosec flags
--
"Exolon (1987)(Hewson Consultants)(48K-128K)[h Nmi-Soft ].tap"
oh, that didn't work
ah yes it did but its invisible
its flagged as a control character in unicode so probably shouldn't be there
https://www.utf8-chartable.de/between the t and ]
--
"Exolon (1987)(Hewson Consultants)(48K-128K)[h Nmi-Soft<delete>]"
--
I think we'll try to monitor all thoses badly encoded invisible characters and maybe some remaning weird é & à
then going throught Iso & Pix dats as well to be sure
then i'll post the corrected Dats here, in case it's hard to dir2dats them anymore