Top 6 Python git-scraping Projects
-
Scout Monitoring
Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
-
nepstonks
An automated bot that scrapes the latest upcoming issues, news, and investment opportunities that are announced inside Nepal and sends them to a telegram channel.
I have done similar work using the GitHub APIs before. I recommend using their GraphQL explorer to develop your queries interactively. You may need to fall back on the REST API instead of the GraphQL one for certain stats.
https://docs.github.com/en/graphql/overview/explorer
You can also refer to my code here, which may already collect some of the statistics you're interested in.
https://github.com/jstrieb/github-stats/blob/master/github_s...
I predict the most annoying part of this project will be dealing with authentication. There are a handful of ways to do it, and the permissions can be finicky depending on what data you are fetching.
Best of luck!
Project mention: Git scraping: track changes over time by scraping to a Git repository | news.ycombinator.com | 2023-08-10Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.
I think it's fine to use the term "scraping" to refer to downloading a JSON file.
These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.
I do run Git scrapers that process HTML as well. A couple of examples:
scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.
scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/
Python git-scraping related posts
Index
What are some of the best open-source git-scraping projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | github-stats | 2,750 |
2 | spotify-playlist-archive | 383 |
3 | csv-diff | 275 |
4 | help-scraper | 41 |
5 | nepstonks | 24 |
6 | scrape-san-mateo-fire-dispatch | 1 |
Sponsored