Author Topic: Unicode/ UTF-8  (Read 194 times)

Offline Tim2460

  • Newbie
  • *
  • Posts: 27
Unicode/ UTF-8
« on: September 08, 2020, 12:01:38 AM »
As we were hunting the last unicode encoding errors, PD users came into that (hopefully) last one :

The Commodore C128 - Applications - CPM - [D71 is supposed to be encoded UTF8

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE datafile PUBLIC "-//Logiqx//DTD ROM Management Datafile//EN" "http://www.logiqx.com/Dats/datafile.dtd%22%3E

<datafile>
    <header>
        <name>Commodore C128 - Applications - CPM - [D71]</name>
        <description>Commodore C128 - Applications - CPM - [D71] (TOSEC-v2018-09-22)</description>
        <category>TOSEC</category>
        <version>2018-09-22</version>
        <author>Duncan Twain - Tomse</author>
        <email>contact@tosecdev.org</email>
        <homepage>TOSEC</homepage>
        <url>http://www.tosecdev.org/</url>
    </header>
    <game name="BusyPack 128 ╪konomisystem (1985)(DSC)(Disk 1 of 3)[Opstartsdiskette]">
        <description>BusyPack 128 ╪konomisystem (1985)(DSC)(Disk 1 of 3)[Opstartsdiskette]</description>
        <rom name="BusyPack 128 ╪konomisystem (1985)(DSC)(Disk 1 of 3)[Opstartsdiskette].d71" size="349696" crc="9d0c67c8" md5="2079b68ff963a4676a5017e9cefc59e9" sha1="7466711da10fd2dd3ab7e13ccb16ad96feacec05"/>
    </game>

it's said on pd thread that the  ╪ is wrong and that it should be an Ø (Encodage UTF-8:    0xC3 0x98)
the ╪ is UTF-8 Encoding:    0xE2 0x95 0xAA

Opinion of the datters ?



Offline Tim2460

  • Newbie
  • *
  • Posts: 27
Re: Unicode/ UTF-8
« Reply #1 on: September 10, 2020, 11:53:55 AM »
Ok so here's an recap of the work done so far on the unicode correction project.
First step was to repair RomVault one more (last?) time so that it better manage the Unicode UTF-8 Format.
That first step had led to an rebuild of 7 zips i had wrong on My PC & on on the first version of the PD Torrent.
But some errors was again discovered by some Various PD Users.
So i have bring the discution to Romvault Discord for other people to join the diagnose and work flow of the correction.
On the first batch of correction that led to the 2nd PD Torrent release, the Dats were actualy good, encoded in UTF-8 fine with the right characters.
From that step, every problem discovered were actually on the Dats themselves, with usualy Nordic & German letters repalced by garbage.
Those are on a very few number of dats, mostly Commodore ones.

Good news is that the Discord People reverse engeniered that actually happened during the DAT release process and managed to know what was the badly encoded name in the first place
it's unicode read in IBM Encoding 437 by error... (maybe in a non unicode - compliant tool) and then reencoded as unicode
we're still working on this ... but here are the results so far :

fix("┬ú", "£");  // 0x00a3
fix("┬░", "°");  // 0x00b0
fix("┬┤", "´");  // 0x00b4
fix("-¦", "´");  // 0x00b4
fix("╪", "Ø");   // 0x00d8
fix("+ÿ", "Ø");  // 0x00d8
fix("├ƒ", "ß");  // 0x00df
fix("├ñ", "ä");  // 0x00e4
fix("├Ñ", "å");  // 0x00e5
fix("├ª", "æ");  // 0x00e6
fix("≈", "ö");   // 0x00f6
fix("├╢", "ö");  // 0x00f6
fix("├╕", "ø");  // 0x00f8
fix("├║", "ú");  // 0x00fa
fix("├╝", "ü");  // 0x00fc
fix("┼é", "ł");  // 0x0142
fix("┼ô", "œ");  // 0x0153
fix("ΓÇÖ", "’"); // 0x2019
fix("ΓÇô", "–"); // 0x2013

the Ø is badly corrected on that list, both time it's the capped letter ... I have let the Discord Know ... but everyoen asleep right now :clin_d'œil: (modifié)

Now thoses errors have only been found on 6 Dats :

Commodore C64 - Demos - [PRG] (TOSEC-v2020-06-28_CM).dat
Commodore C64 - Graphics - [PRG] (TOSEC-v2020-06-28_CM).dat
Commodore C64 - Music - [D64] (TOSEC-v2020-07-03_CM).dat
Commodore C64 - Music - [PRG] (TOSEC-v2020-06-28_CM).dat
Commodore C64 - Music - [T64] (TOSEC-v2020-06-28_CM).dat
Commodore C128 - Applications - CPM - [D71] (TOSEC-v2018-09-22_CM).dat
I have the corrected version of the dats and if i counted well, there were only 52 roms with bad names
BUT
we have found the same kind of problems ... that have led to the creation that characters that don't even translate to something visible in UTF-8... Theses are supposely control characters, not displayable ones.
First one found is :
--
"Sinclair ZX Spectrum - Games - [TAP]" has DELETE [7F] character in a set, how does that even happen
ok, aside from that one, the only other weird one i've spotted is in that same dat "Petris (1996)(PTsoft)(ES)(en)[И].tap"
can't find И in the list of tosec flags
--
"Exolon (1987)(Hewson Consultants)(48K-128K)[h Nmi-Soft ].tap"
oh, that didn't work
ah yes it did but its invisible
its flagged as a control character in unicode so probably shouldn't be there https://www.utf8-chartable.de/
between the t and ]
--
"Exolon (1987)(Hewson Consultants)(48K-128K)[h Nmi-Soft<delete>]"

--
I think we'll try to monitor all thoses badly encoded invisible characters and maybe some remaning weird é & à
then going throught Iso & Pix dats as well to be sure
then i'll post the corrected Dats here, in case it's hard to dir2dats them anymore

Offline Maddog

  • Global Moderator
  • Full Member
  • *****
  • Posts: 192
Re: Unicode/ UTF-8
« Reply #2 on: September 11, 2020, 11:15:38 AM »
The recent Spectrum dats have been generated by an automated tool created by Lady Eklipse. This scrapes many sites to get its' information.
I am not fully familiar with the process, but there's no human intervention AFAIK.
Therefore these weird errors in Spectrum dats can be a result of either 1) error in the tool's programming (unlikely IMHO) or 2) the names are like that on the scraped site, whichever that is.
I suppose only she can look into this issue, as her dat creation process is different from any other datter's.