Wiki Content Extractor Idea

     #=             
     ###=           
     #####=         
.    #######=       
#*:  #########=     
###*:###########=   
##################= 
######*+*##*++#####*
#####====++====#####
#####+========+#####
=#####*+====+*#####+
 .+#####*++*#####+. 
   .+##########+.

In the interest of bringing more content to Gemini, mostly for my own use, I've thought about building a scrapper for Wikia (now called Fandom) content. This content is all Creative Commons licensed so scrapping/archiving it is completely within license (IANAL).

I played around with this idea a bit this morning and think I have something that's fairly decent. It's only a proof of concept at this point, but this is what I've been testing out.

First, a page is downloaded, for example this is a page about Sonic the Hedgehog

Sonic the Hedgehog

Converting to gemtext

The next step would be to convert this into gemtext. But unfortunately the HTML has a lot of clutter, so simply piping this into html2gmi results in an unreadable page.

So instead what I have tried is using an "article extractor". If you've ever used a service like Pocket or a reader mode in your web browser, what those do is extract just the content of an article out.

In Go there's a library called ce that does this.

I tested it out and it works well. I'm sure there are more out there.

This gives you fairly clean HTML. Piping this into html2gmi and it comes out fairly clean! This is the html2gmi program I used:

html2gmi

It's a CLI but there's a Go library as well. This takes the HTML and removes all of the parts that gemtext doesn't support, giving you a fairly clean gemtext file.

Adding images

But there's more! While the generated gemtext is quite reasonable it's also fairly plain. We can give it some pizazz, I think. The article extractor includes images and you can turn those into external links with html2gmi. What would be even cooler would be to turn those into inline ascii art!

image2ascii

image2ascii is a library that takes an image and converts it into ascii. It's quite good, actually.

Next steps

So full picture:

Download an page
Extract just the content (the article)
Download and convert images to ascii
Convert the full thing into gemtext

Given these primitives I think we have a recipe for a fairly decent Wikia/Fandom article to gemtext converter. Making this useful will take a bit more work; you'd probably want to build a crawler around this and some mechanism to keep it all up to date (probably using the Special:RecentChanges page). I'm going to continue playing around with this idea and report back.

Updates

2022-05-23

Discovered that Fandom supports the api.php route that Wikipedia does, so there's possibly some easier HTML to parse:

Example API response