Author Topic: Latest TOSEC release: Possible errors/Mismatches in data  (Read 976 times)

Offline Casteele

  • Newbie
  • *
  • Posts: 12
Latest TOSEC release: Possible errors/Mismatches in data
« on: March 29, 2022, 09:54:12 AM »
I have a tool (I wrote myself) that scans the TOSEC data and dumps possible errors or names it fails to parse properly. Here is my latest output from the 2021-12-31 DAT release.

Note that most of these are mismatched/missing braces/parentheses, some doubled up of parts, and in a few cases, embedded double quotes in the names. A this time, this tool only compares the "name" attribute of the <game> tag with the text value of the <description> element for an _exact_ match. It does not check the name attribute of the <rom> elements.

(It does, however, check for malformed UTF-8 encodings file-wide, but found none!)

Here is the output:
Code: [Select]
file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC-ISO/3DO 3DO Interactive Multiplayer - Firmware (TOSEC-v2017-07-07_CM).dat"
name :: "Kanji v16.4 ROM (1994)(Panasonic)(JP)[Japanese Font]"
desc :: "Kanji 16.4 ROM (1994)(Panasonic)(JP)[Japanese Font]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC-ISO/Sega Dreamcast - Games - JP (TOSEC-v2021-07-30_CM).dat"
name :: "Sakura Taisen 4 - Koi Seyo Otome v1.003 (2002)(Sega)(JP)[!][2M1, 2M3, 2M5, 2MM1]"
desc :: "Sakura Taisen 4 v1.003 (2002)(Sega)(JP)[!][2M1, 2M3, 2M5, 2MM1]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC-ISO/Sega Dreamcast - Games - JP (TOSEC-v2021-07-30_CM).dat"
name :: "Sakura Taisen 4 - Koi Seyo Otome v1.003 (2002)(Sega)(JP)[!][HM223A, HM302B, HM303F]"
desc :: "Sakura Taisen 4 v1.003 (2002)(Sega)(JP)[!][HM223A, HM302B, HM303F]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC-PIX/Apple Macintosh - Magazines - Maclife (TOSEC-v2019-04-25_CM).dat"
name :: "MacLife - Issue No 048 (2011-01)(Future Publishing)(GB)"
desc :: "MacLife - Issue No 048 (2011-01)(Future Publishing)(GB).pdf"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC-PIX/Commodore Amiga - Magazines - AmigaWorld Tech Journal (TOSEC-v2010-01-16_CM).dat"
name :: "AmigaWorld Tech Journal - Volume 2 Number 2 (1992-04)(IDG Communications)(US)"
desc :: "AmigaWorld Tech Journale - Volume 2 Number 2 (1992-04)(IDG Communications)(US)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC-PIX/Commodore Amiga - Manuals - Hardware (TOSEC-v2014-02-01_CM).dat"
name :: "GVP Impact SCSI Controller Users Guide (1988)(GVP)"
desc :: "GVP Impact SCSI Controller Users Guide (1988)(GVP)(A500-A2000)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC-PIX/Exidy Sorcerer - Books (TOSEC-v2010-01-02_CM).dat"
name :: "Guided Tour Of Personal Computing, A (1979-02)(Exidy)"
desc :: "Guided Tour Of Personal Computing, A (1979-02)(Exid)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC-PIX/Franklin ACE 500 - Manuals - Hardware (TOSEC-v2012-04-15_CM).dat"
name :: "Franklin ACE 500 - User's Reference Manual (1986)(Franklin Computer)"
desc :: "Franklin Ace 500 - User's Reference Manual (1986)(Franklin Computer)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC-PIX/Jupiter Cantab Jupiter Ace - Magazines - Ace User (TOSEC-v2010-01-15_CM).dat"
name :: "Jupiter Ace Users Group Introduction (1984)(Remsoft)(GB)"
desc :: "Jupiter Ace Users Group Introduction (1984)(Remsoft)(GB)o"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC-PIX/Multi-format - Magazines - Atomix (MX) (TOSEC-v2010-01-23_CM).dat"
name :: "Atomix - Issue 69 (2005-11)(Limit X Media)(MX)"
desc :: "Atomix Magazine - Issue 69 (2005-11)(Limit X Media)(MX)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC-PIX/Multi-format - Magazines - Atomix (MX) (TOSEC-v2010-01-23_CM).dat"
name :: "Atomix - Issue 74 (2006-04)(Limit X Media)(MX)"
desc :: "Atomix Magazine - Issue 74 (2006-04)(Limit X Media)(MX)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Applications - [DSK] (TOSEC-v2021-12-11_CM).dat"
name :: "Boulder Dash Construction Kit (1987)(Epyx)(II+)[cr 4am][earlier crack]"
desc :: "Boulder Dash Construction Kit (1987)(Epyx)(II+)[48K][cr 4am][earlier crack]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Applications - [DSK] (TOSEC-v2021-12-11_CM).dat"
name :: "Invoice Factory, The v1.5 (1981)(Micro Lab)(US)(Disk 2 of 2)(Report Program)[cr 4am]"
desc :: "Invoice Factory, The v1.5 (1981)(Micro Lab)(US)Disk 2 of 2)(Report Program)[cr 4am]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Applications - [PO] (TOSEC-v2021-12-11_CM).dat"
name :: "PC Transporter v1.30 (1988-09-07)(Applied Engineering)(Disk 1 of 2 Side A)[AEPCT][req PC Transporter card]"
desc :: "ZPC Transporter v1.30 (1988-09-07)(Applied Engineering)(Disk 1 of 2 Side A)[AEPCT][req PC Transporter card]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Compilations - Games - [2MG] (TOSEC-v2021-07-25_CM).dat"
name :: "Total Replay v2.0 (2019-09-06)(4am)(MIT)(beta)[beta 1]"
desc :: "Total Replay v2.0(2019-09-06)(4am)(MIT)(beta)[beta 1]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Compilations - Games - [DO] (TOSEC-v2021-02-12_CM).dat"
name :: "Taxman (1981)(H.A.L. Labs) amp; Outpost (1981)(Sirius Software)[48K] & Star Blaster (1981)(Piccadilly Software)[cr Mr. Xerox] & Star Wars (1979)(Brown, Donald) & Apple-oids (1981)(California Pacific Computer)"
desc :: "Taxman (1981)(H.A.L. Labs) & Outpost (1981)(Sirius Software)[48K] & Star Blaster (1981)(Piccadilly Software)[cr Mr. Xerox] & Star Wars (1979)(Brown, Donald) & Apple-oids (1981)(California Pacific Computer)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Compilations - Various - [DSK] (TOSEC-v2021-12-11_CM).dat"
name :: "Autobahn (1981)(Sirius Software) & Class Manager (19xx)(-) & Dalton Disk Disintegrator v2.0 (1985)(Dalton) & Softporn Adventure (198x)(Latenight Software)"
desc :: "\"Autobahn (1981)(Sirius Software) & Class Manager (19xx)(-) & Dalton Disk Disintegrator v2.0 (1985)(Dalton) & Softporn Adventure (198x)(Latenight Software)\""

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Educational - [DO] (TOSEC-v2021-12-11_CM).dat"
name :: "Wizard of Id's WizType, The (1984)(Sierra On-Line)[cr Disk Jockey][a]"
desc :: "Wizard of Id's WizType, The(1984)(Sierra On-Line)[cr Disk Jockey][a]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Educational - [WOZ] (TOSEC-v2021-12-11_CM).dat"
name :: "Addition Logician (1984)(MECC)(II+)(US)[a][A-125]"
desc :: "Addition Logician (1984)(II+)(MECC)(US)[a][A-125]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Educational - [WOZ] (TOSEC-v2021-12-11_CM).dat"
name :: "Business Vol. 3 - Accounting v1.8 (1981)(MECC)(II+)(US)[48K][A-721]"
desc :: "Business Vol. 3 - Accounting v1.8 (1981)(MECC)(II+)[48K](US)[A-721]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Games - [DO] (TOSEC-v2021-12-11_CM).dat"
name :: "Eamon Adventure #002 - The Minotaur's Lair (1984-12-15)(Brown, Donald)(US)[f converted to STD][Eamon Adventurer's Guild]"
desc :: "Eamon Adventure #002 - The Minotaur's Lair (1984-12-15)(Brown, Donald)(US)[f converted to STD][no boot][Eamon Adventurer's Guild]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Games - [DO] (TOSEC-v2021-12-11_CM).dat"
name :: "Eamon Adventure #046 - Life Quest (1985-05-15)(Crawford, David)(US)[Eamon Adventurer's Guild]"
desc :: "Eamon Adventure #046 - Lifequest (1985-05-15)(Crawford, David)(US)[Eamon Adventurer's Guild]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Games - [DO] (TOSEC-v2021-12-11_CM).dat"
name :: "Eamon Adventure #065 - Swordquest (1984-07-15)(Pender, Roger)(US)[Computer Learning Center Library]"
desc :: "Eamon Adventure #065 - The School of Death (1984-07-15)(Pender, Roger)(US)[Computer Learning Center Library]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Games - [DO] (TOSEC-v2021-12-11_CM).dat"
name :: "Eamon Adventure #117 - Dungeon of Doom v1.8 (1986-08-26)(Knezek, Dan)(US)[Eamon Adventurer's Guild][req 80-col card]"
desc :: "Eamon Adventure #117 - Dungeon of Doom v1.8 (19xx)(Knezek, Dan)(US)[Eamon Adventurer's Guild][req 80-col card]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Games - [DO] (TOSEC-v2021-12-11_CM).dat"
name :: "Eamon Adventure #157 - Pathetic Hideout of Mr. Roessler (1988-10-11)(Segerlind, Nathan)(US)[Eamon Adventurer's Guild]"
desc :: "Eamon Adventure #157 - Pathetic Hideout of Mr. R. (1988-10-11)(Segerlind, Nathan)(US)[Eamon Adventurer's Guild]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Games - [DO] (TOSEC-v2021-12-11_CM).dat"
name :: "Eamon Adventure #160 - Monty Python and the Holy Grail (1988-08-29)(Segerlind, Nathan)(US)[Eamon Adventurer's Guild]"
desc :: "Eamon Adventure #160 - Monty Python & Holy Grail (1988-08-29)(Segerlind, Nathan)(US)[Eamon Adventurer's Guild]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Games - [DO] (TOSEC-v2021-12-11_CM).dat"
name :: "Eamon Adventure #181 - The Eamon Sewer System (1989-07-05)(Parker, Robert)(US)[Eamon Adventurer's Guild]"
desc :: "Eamon Adventure #181 - The Eamon Sewer System (1989-07-05)(Parker, Robert)(US)[Eamon Adventurer's Guild]\""

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Games - [DSK] (TOSEC-v2021-12-11_CM).dat"
name :: "Batman - The Caped Crusader (1988)(Data East)(IIE)[cr 4am][work disk]"
desc :: "Batman - The Caped Crusader (1988)(Data East)(IIE)(1988)(Data East)[cr 4am][work disk]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Games - [DSK] (TOSEC-v2021-12-11_CM).dat"
name :: "SunDog - Frozen Legacy v2.0 (1984)(Accolade)(Disk 2 of 2)[cr Nut Cracker]"
desc :: "SunDog - Frozen Legacy v2.0 (1984)(Accolade)(Disk 2 of 2`)[cr Nut Cracker]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Atari 5200 - Games (TOSEC-v2021-07-25_CM).dat"
name :: "Berzerk (1983)(Atari)(US)[b][CX5221]"
desc :: "Berzerk (1983)(Atari)(US)[b]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Convergent Technologies AWS NGEN Workstation - Applications - [IMA] (TOSEC-v2017-10-23_CM).dat"
name :: "CTOS Ethernet Packet Driver v3.7.0 (19xx)(-)[B25-PDL]"
desc :: "CTOS Ethernet Packet Driver v3.7.0(19xx)(-)[B25-PDL]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Ferguson Big Board II - Collections - User Disk (TOSEC-v2017-04-05_CM).dat"
name :: "Big Board User Disk #05 (19xx)(Micro Cornucopia)"
desc :: "Big Board User Disk #5 (19xx)(Micro Cornucopia)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Funtech Super A'can - Games (TOSEC-v2021-07-25_CM).dat"
name :: "Super Taiwanese Baseball League ~ Chao Ji Zhong Hua Zhi Bang Lian Meng (1995)(C&E Soft)(TW)[F005]"
desc :: "Super Taiwanese Baseball League ~ Chao Ji Zhong Hua Zhi Bang Lian Meng (1995)(C&ESoft)(TW)[F005]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/NEC PC-Engine & TurboGrafx-16 - Compilations (TOSEC-v2018-06-09_CM).dat"
name :: "15 in 1 Mega-Collection - Backtracking Ten Years (1992)(Image)(FI)(en)"
desc :: "15 in 1 Mega-Collection - Backtracking Ten Years (1992)(Image)(FI)(en).pce"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/NEC PC-Engine & TurboGrafx-16 - Games (TOSEC-v2021-07-25_CM).dat"
name :: "Cyber Cross (1989)(Face)(JP)[AKA Busou Keiji - Cyber Cross]"
desc :: "Cyber Cross (1989)(Face)(JP)[AKA Busou Keiji Cyber Cross]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Sega Mega Drive & Genesis - Games - [BIN] (TOSEC-v2021-07-25_CM).dat"
name :: "Dr. Robotnik's Mean Bean Machine (1993-09)(Sega)(US)(proto)[h Premiere]"
desc :: "Dr. Robotnik's Mean Bean Machine (1993-09)(Sega)(US)(proto)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Tangerine Oric-1 & Oric Atmos - Educational - [TAP] (TOSEC-v2021-12-11_CM).dat"
name :: "Vers La Lecture - Tests de Reconnaissance Visuelle (19xx)(-)(Atmos)(fr)"
desc :: "Vers La Lecture (19xx)(-)(Atmos)(fr)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Tangerine Oric-1 & Oric Atmos - Games - [DSK] (TOSEC-v2021-12-11_CM).dat"
name :: "Ghost Gobbler (19xx)(IJK Software)(FR)"
desc :: "Pac-man (19xx)(Christophe Devalland)(fr)[Sedoric]"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Tangerine Oric-1 & Oric Atmos - Games - [DSK] (TOSEC-v2021-12-11_CM).dat"
name :: "Tetris GB (1992)(JCB Techniques)"
desc :: "Tetris GB (1992)(JCB Techniques)(PD)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Tangerine Oric-1 & Oric Atmos - Games - [TAP] (TOSEC-v2021-12-11_CM).dat"
name :: "Hopper (1983)(Personal Software Services)(GB)"
desc :: "Hopper (19xx)(Personal Software Services)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Tangerine Oric-1 & Oric Atmos - Games - [TAP] (TOSEC-v2021-12-11_CM).dat"
name :: "Plouf (19xx)(Larcher, Dominique)(fr)"
desc :: "Plouf (19xx)(Larcher, Dominique)(fr)(PD)"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Tangerine Oric-1 & Oric Atmos - Games - [TAP] (TOSEC-v2021-12-11_CM).dat"
name :: "Secret du Tombeau, Le (1985-06)(Loriciels)(FR)[AKA Tombeau d'Axayacatl, Le]"
desc :: "Secret du Tombeau, Le (1985-06)(Loriciels)(FR)[AKA Tombeau d'Axayacatl, Le"

file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Tangerine Oric-1 & Oric Atmos - Games - [TAP] (TOSEC-v2021-12-11_CM).dat"
name :: "Slalom (19xx)(-)(Atmos)(FR)(Atmos)(FR)"
desc :: "Slalom (19xx)(-)(Atmos)(FR)"



Offline Casteele

  • Newbie
  • *
  • Posts: 12
Re: Latest TOSEC release: Possible errors/Mismatches in data
« Reply #1 on: March 29, 2022, 02:14:33 PM »
Small (well, sort of...) update:

I fixed and got the name checking between the <game> and <rom> elements working -- with/without file .ext matching. The results:

Entries which include the file extension in the <game name=...> or <description>, -OR- are missing the file extension in the <rom name=...>:
20
I can post the list here if desired. Many of them can be fixed or guessed (such as entries in the "... [BIN]" dats are obviously meant to be .bin files).

Entries which the <game name=...> is _DIFFERENT_ than the <rom name=...>, regardless of extension:
236797
There is no way I will be posting that many errors here... That is too man to even summarize how _much_ the differ -- although many of the last couple dozen listed are simply where something like "(Track x of y)" has been added to either the <game> or <rom> name.



Offline mictlantecuhtle

  • Global Moderator
  • Full Member
  • *****
  • Posts: 146
Re: Latest TOSEC release: Possible errors/Mismatches in data
« Reply #2 on: March 29, 2022, 08:17:20 PM »
Hey, thanks for this, looks like a really valuable piece of work! Could you please post the smaller list for sure and I'll take a look and fix those entries for next release?

If you're able to upload a copy of the output of your tool for the (much) larger list somewhere that would also be of interest. Certainly won't be for next release but I'd like to see if I can find some way to meaningfully parse it and see if we can dig out places where there's a genuine error, rather than the rom / game names differing because of the way the convention works.

Offline Casteele

  • Newbie
  • *
  • Posts: 12
Re: Latest TOSEC release: Possible errors/Mismatches in data
« Reply #3 on: March 30, 2022, 01:00:34 AM »
Here is the smaller list, manually sorted by type. Note there are actually 30 items; My previous count was from _after_ I had already fixed 10 of them in my dat files.

Code: [Select]
*** Has extension added in <game name=...> and/or <description>

  file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/NEC PC-Engine & TurboGrafx-16 - Compilations (TOSEC-v2018-06-09_CM).dat"
  game :: "15 in 1 Mega-Collection - Backtracking Ten Years (1992)(Image)(FI)(en)[h Pure-Byte][AKA 15 in 1 Pure-Collection].pce"
  game :: "15 in 1 Mega-Collection - Backtracking Ten Years (1992)(Image)(FI)(en)[u].pce"
  game :: "15 in 1 Mega-Collection - Backtracking Ten Years v2.0 (1992)(Image)(FI)(en).pce"

*** Missing extensions from <rom> names -- Can be inferred from the dat file, however.

  file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Compilations - Applications - [DO] (TOSEC-v2021-02-12_CM).dat"
  game :: "Various Catdialers (19xx)(-)"

  file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Apple II - Games - [DO] (TOSEC-v2021-12-11_CM).dat"
  game :: "Eamon Adventure #031 - The Gauntlet (19xx)(Nelson, John)(US)[Computer Learning Center Library]"

  file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Sega Mega Drive & Genesis - Games - [BIN] (TOSEC-v2021-07-25_CM).dat"
  game :: "Amy Rose in Sonic the Hedgehog v2.1 (2016-06-15)(E-122-Psi)[h Sonic the Hedgehog]"
  game :: "Herzog Zwei - Bloodbath Edition v1.0 (2008-05-04)(Eisfrei)[h Herzog Zwei]"
  game :: "Rent a Hero (1991-09-20)(Sega)(JP)[tr en NikcDC][v0.97 Genesis]"
  game :: "Streets of Rage 2 (1992-12)(Sega)(US)[f checksum][h add Ristar Metal64][v1.1]"

*** Missing extensions from <rom> names -- Can be inferred from other similat items nearby.

  file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/IBM Displaywriter - Educational (TOSEC-v2021-12-11_CM).dat"
  game :: "IBM Displaywriter System Training - Textpack E and 2 (198x)(IBM)[id S544-2271-0]"

*** Missing extensions -- Unable to infer anything from the immediate information at hand.

  file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/VTech Genius - Educational (TOSEC-v2018-07-01_CM).dat"
  game :: "Allgemeinwissen (200x)(Video Technology)(8008 CX)(DE)"
  game :: "Diktate (200x)(Video Technology)(Leader Power)(DE)"
  game :: "Englisch fur Anfanger (200x)(Video Technology)(DE)"
  game :: "Franzosisch Total (200x)(Video Technology)(Leader Notebook)(DE)"
  game :: "Lander-Menschen-Umwelt (200x)(Video Technology)(DE)"
  game :: "Schreibmaschinenkurs (200x)(Video Technology)(8008 CX)(DE)"
  game :: "Sport-Wissen-Geschichte (200x)(Video Technology)(8008 CX)(DE)"
  game :: "Sport-Wissen-Geschichte (200x)(Video Technology)(Leader - Power - Notebook Plus)(DE)"
  game :: "Wortspiele (200x)(Video Technology)(Junior Redstar - Redstar 2)(DE)"

  file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/VTech Genius Leader 2000 - Educational (TOSEC-v2018-07-01_CM).dat"
  game :: "Beruhmte Orte und Leute (199x)(Video Technology)(DE)"
  game :: "Englisch fur Anfanger (199x)(Video Technology)(DE)"
  game :: "Englisch fur Fortgeschrittene (199x)(Video Technology)(DE)"
  game :: "Franzosisch für Anfanger (199x)(Video Technology)(DE)"
  game :: "Schreibmaschinenkurs (199x)(Video Technology)(DE)"
  game :: "Super Naturwissenschaften (199x)(Video Technology)(DE)"

  file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/VTech Genius Leader 6000SL - Educational (TOSEC-v2018-07-01_CM).dat"
  game :: "Tabellenkalkulation (199x)(Video Technology)(DE)"

  file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/VTech Genius Leader Color - Educational (TOSEC-v2018-07-01_CM).dat"
  game :: "Carlchen Clever - Wort- und Zahlenlabyrinth (199x)(Video Technology)(DE)"
  game :: "Fling - Die Buchstaben-Schleuder (199x)(Video Technology)(DE)"
  game :: "Tatort Umwelt (199x)(Video Technology)(DE)"
  game :: "Zahlenknacker (199x)(Video Technology)(DE)"

Regarding the bigger list... I can upload it, or I can upload the tool used -- it is a TCL script (always in plain text "source" form), which TCL comes included on Linux and Macintosh, or can be downloaded from many places on the web for Windows.

It might be better to wait, however, for two things:

1) I finish the parser for parsing the TNC-style names; and
2) I re-write the naming detection code to better classify the type of error.

Currently, the detection code uses a simple method:

It compares the text from <game name=...> to the text from <rom name=...>. IF they are equal, there are only two reasons:

1) The <game> name also includes the extension; Or,
2) The <rom> name does _not_ include an extension.

That is, <game name="some game (19xx)(-).ext">...<rom name="some game (19xx)(-).ext"></game> _OR_ <game name="some game (19xx)(-)">...<rom name="some game (19xx)(-)"></game>.

In those cases, it immediately flags the error and returns.

To be more technical, the code only compares the text up to the _length_ of the game name text, because I observed that "game name (year)(pub)" and "game name (year)(pub).ext" should _always_ differ by the ".ext" part; The part before that should always be equal.

Thus, some errors might be cases like: <game name="game v1.0 (year)(pub)">...<rom name="game v1.1 (year)(pub)(comment).ext"></game>, where: the version was corrected and "(comment)" was added.

In such cases, by completing the TNC parser, I can detect where information was added, and automatically add it as needed under the presumption that adding info causes no _loss_ of info, even if the info added is incorrect. Incorrect info can always be corrected later.

Cases where info is different/corrected or removed will still need to be handled by humans, however, to prevent loss of info.

Offline Casteele

  • Newbie
  • *
  • Posts: 12
Re: Latest TOSEC release: Possible errors/Mismatches in data
« Reply #4 on: March 30, 2022, 01:14:49 AM »
Also note, per my original first post, I have already corrected those in my dat files, except one:

Code: [Select]
file :: "TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Tangerine Oric-1 & Oric Atmos - Games - [DSK] (TOSEC-v2021-12-11_CM).dat"
name :: "Ghost Gobbler (19xx)(IJK Software)(FR)"
desc :: "Pac-man (19xx)(Christophe Devalland)(fr)[Sedoric]"

I am not familiar enough with that system to know if he correct name is "Ghost Gobbler" or "Pac-man" (or verify the publisher/author). Thus, in my dat sets, I simply copied the entry, so one has one name, the other has the other name. If anyone can let me know for certain, please do.

Offline Casteele

  • Newbie
  • *
  • Posts: 12
Re: Latest TOSEC release: Possible errors/Mismatches in data
« Reply #5 on: March 30, 2022, 10:54:39 AM »
One more "error" found. Previously, I was only checking for "UTF-8" "correctness". I expanded that to do some additional checks, and found:

In file:
   TOSEC - DAT Pack - Complete (3312) (TOSEC-v2021-12-31)/TOSEC/Sinclair ZX Spectrum - Games - [TAP] (TOSEC-v2021-01-15_CM).dat


Lines 23613-23615:
Code: [Select]
<game name="Exolon (1987)(Hewson Consultants)(48K-128K)[h Nmi-Soft\x7F]">
<description>Exolon (1987)(Hewson Consultants)(48K-128K)[h Nmi-Soft\x7F]</description>
<rom name="Exolon (1987)(Hewson Consultants)(48K-128K)[h Nmi-Soft\x7F].tap" size="45600" crc="5bc50634" md5="d9b9e1556f3cea67852b8625dabbf7d2" sha1="6bf4cf385f1a5dd418b8f141d8b15f7dde309ce4"/>

There is an embedded \x7F in the name, near the end (as marked in the copy/paste, above).

Also note that some of the <?xml ...?> declarations lack an "encoding=..." attribute, and quite a few files have a Byte Order Marker (BOM). UTF-8 files _should not_ (but it is considered an accepable evil due to broken/buggy UTF-8 text edittors) use a BOM, because the BOM was meant for UTF-16 encodings -- UTF-8 is always a series of single 8-bit bytes and byte order is meaningless.

My recommendation is actually avoid even UTF-8 _directly_, and use it _indirectly_ by either XML character entities/encoding, or URI %XX encodings. Keep all data files plain ASCII. But I realize that may break some/many tools that read the TOSEC database and have not been updated or maintained for many years. :-/