[Python] How can I clean up Wikipedia's XML backup dump to create dictionaries of commonly used words for multiple languages?

This page summarizes the projects mentioned and recommended in the original post on /r/learnprogramming

Scout Monitoring - Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
www.scoutapm.com
featured
InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
  • mwparserfromhell

    A Python parser for MediaWiki wikicode

  • In particular what you're looking at is not XML but wikitext. I found a discussion on stackoverflow about solving the same problem of getting text from wikitext. Seems like the most promising solution in Python since you already have the dump is to run each page through mwparserfromhell. According to the top stackoverflow answer you could use something like

  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Processing Wikipedia Dumps With Python

    1 project | /r/programming | 18 May 2023
  • How can I clean up Wikipedia's XML backup dump to create dictionaries of commonly used words for multiple languages?

    2 projects | /r/learnpython | 10 Oct 2021
  • I spent the 2 weeks building a complex data parsing program for a data project and today I found out that such a library already exists.

    1 project | /r/learnprogramming | 14 May 2022
  • [UPDATE] Here's the transcript of the 1781 most-used German Nouns according to a 4.2 million word corpus research performed by Routledge

    1 project | /r/German | 9 Jul 2021
  • The Future of MySQL is PostgreSQL: an extension for the MySQL wire protocol

    1 project | news.ycombinator.com | 26 Apr 2024