Finally I got my hands on a copy of the DVDs that educ.ar released using CDPedia, the free software project that compresses as much as possible of the Spanish Wikipedia on a disc.
Thanks to the help of Jimmy Wales and Martín Varsavsky these 8.5Gb DVDs are being distributed to each and all schools in Argentina, and it’s a great thing because they contain 100% of the articles in the Spanish Wikipedia, and all of the accompanying images (some in reduced size), and -it’s also worth noting- they also may contain any error or vandalism that may have happened in Wikipedia at the time each page was downloaded.
This post is to celebrate the project coming full circle, so I’m going to tell the story of how it was completed, and I’m going to finish with some ideas for the next stage.
The story of the project
Our CDPedia project was born around 2006 with the idea of getting the knowledge accumulated on Wikipedia to the most remote schools in the country: those where there is no Internet.
Some members of the Python Argentina user’s group got together on a few random sprints after free software events, and worked on improving the source code. As every amateur software project that’s starting up, CDPedia was developed during the free time of every contributor, and we released a first 0.5 version in 2009, during our first PyCon Argentina conference that took place just a week after Wikimanía in Buenos Aires.
Something unusual happened by the end of 2009: I got an email from Jimmy Wales, that was coming to Buenos Aires the following month and wanted to find out more about our project. We met him, showed him the 0.6 version we were working on, and he praised it because it looked so much like the online Wikipedia. That same day I had a meeting with him and some members of educ.ar, the educational web portal of the Argentine state, which through a grant of the Varsavsky Foundation would be able to publish a dual layer DVD with a CDPedia edition.
During that meeting we talked about the things that were needed to achieve an interesting result: we had to test CDPedia on many “relatively old” computers, such as the PCs that you can find through the country; also there were many bugs to fix, and most importantly of it all: due to technical issues the Wikimedia Foundation had stopped publishing the “html dumps” that our project used as a base. The final one was from June 2008, and we all agreed that it made no sense to publish a DVD in 2010 with content that was so old.
That’s how we managed to get financing for two interns from PyAr during three months part-time to fix bugs and polish every technical detail. But after trying through several channels and in diverse ways we were not able to get an up-to-date html dump similar to the one we were using, so we ended up using the hours assigned to the interns to build a replica of the Wikipedia setup starting from the DB dumps that Wikimedia did provide at the time. This road also proved worthless, because there were many configuration and performance details that were eluding us, and that meant that our Wikipedia test setup never worked fine.
The project was delayed, we were angry and hopeless, when at the post-PyCon 2010 asado in October 2010, one of the contributors suggested making a small program to download the whole of Wikipedia, one page at a time, and directly from his home. This was an option that we had half-jokingly suggested earlier on a meeting with educ.ar, so our reaction was one of true disbelief: a lot of things could go wrong with that idea. But as it so often happens, code beats opinions. Two days later SAn had achieved through an alternate road a Spanish Wikipedia html dump, something that we had not managed to achieve in months of efforts.
From there we worked a lot on updating our code due to every detail that had changed in Wikipedia since 2008, and on optimizing the disc space because the growth in pages and images had been exponential. We managed to send the final version to educ.ar by the end of June 2011, and we worked later so the disc covers and the disc itself had some kind of legend that encouraged the copying of the disc freely, but with the license restrictions of each part: the main content from Wikipedia, the classroom material made by educ.ar and the free software from the CDPedia project.
I must now focus the spot on these guys: Diego Mascialino, Facundo Batista and Santiago Picinnini for the amount of time they put in the project during the final race to the 0.7 version that was used in the educ.ar disc, and also on all the contributors that thru the lifetime of CDPedia helped with code and ideas. Having added my grain of sand to this project together with my friends from Python Argentina makes me so much proud.
This is also a good time to start thinking on some of the things we should work on to make CDPedia 1.0 much better. Here are a few:
- it ought to work fine in school servers with no Internet access, or where this access is limited, since CDPedia right now only works fine for only one user at a time
- we should facilitate local installation, for educational laptops like the “Conectar Igualdad” plan in Argentina, or the OLPCs
- we should work on improving the current CDPedia so it can be useful for other spanish speaking countries
- we should work -probably with the Wikimedia Foundation- so it can be used for the offline editions of Wikipedia in other languages.