Podcast Metadata Scraping
Turns out scraping metadata from iTunes compatible podcast RSS feeds is extremely easy. I couldn't remember which episode Lebanese Politics Podcast had the hosts talking about the Beirut trash protests, so I went to iTunes to try and find out. However, the search in iTunes is shockingly bad. Plus I wanted to run a grep
through the thing for my own purposes anyways, so I decided to scrape this information down.
The code is at https://github.com/peixian/podcast-description-search and distributed under an MIT license.
It takes in the iTunes RSS feed and can either do a simple query on the podcast description, or throw back a csv:
$ ./scrape.py -h usage: search.py [-h] [-q QUERY] [-o OUT] itunes_url Takes in an itunes podcast id, searches for a specific string, also dumps all the shows to a csv. positional arguments: itunes_url URL from itunes for podcast optional arguments: -h, --help show this help message and exit -q QUERY, --query QUERY Specific string to search for. This is quite dumb so the csv with a more complex engine might be better -o OUT, --out OUT Path to dump the results to a csv
Since iTunes compatible RSS feeds are standardized, turns out the public API left up at https://itunes.apple.com/lookup
takes a single ID
url param, which is a iTunes global unique ID. You can do a GET
on this endpoint without throwing any auth, and it'll return back where the original RSS feed is hosted at. In my case, it was hosted at Soundcloud, which then provides the open endpoint to grabbing the iTunes standarized RSS format.