Interesting that they manually transcribed the data to Excel. It would also be interesting to know how they mapped from the excel files to the final dataset. I wonder if LLMs could do the switch from scans to structured data more efficiently, and how much of a hit to accuracy would be involved.
The other day I came across this pricing dataset https://oria-data.trillianthealth.com/ (this is just for pricing though)
There must be some gem datasets like these - I wish I had the time (and expertise) to explore
https://canmod.net/digitization/