March 26, 2013
Filling in music metadata
I don't listen to a lot of music, even though my hard drive is filled with gigs and gigs of the stuff. Actually, the fact that I have so much of it is part of the problem -- I can't just hit "Shuffle" on my music collection and enjoy. With Amy's music, Kid's music, Christmas music, etc., in the mix, it's only a matter of time before I hit a dud. Even within the music I've collected, not every song on an album is a winner. (I don't have the heart to delete the other ones, though. It just feels wrong to only have part of an album). And now that I have kids, the problem is further complicated by the fact that some of the lyrics in my music have what we around my house call "grown-up" words. So playing random songs when the kids are around is out. If I have my hands covered in finger paint, I'm not going to be able to get to the computer to skip tracks if Eminem decides he needs a rhyme for "truck."
These problems would be alleviated if I entered ratings and tags for all my music with something like iTunes. But that involves me sitting in front of a computer and doing this for every song. Thousands of songs. That's not going to happen. I've tried the "rate-as-you-go" approach, but I never remember to do it. So I wanted to find a way to automatically fill in the song information I was looking for.
Luckily, I did.
Finding the best music
One day, I was on the iTunes store, debating if I should buy an album, and I noticed that they listed the purchase popularity of each song.

The most popular ones happened to be the songs I liked best for that album. I had an idea: If I could add the iTunes popularity data to each of my songs, I would have a way of generating a playlist of popular songs. Sure, I like to think that I'm a special snowflake, and that no one else likes the same songs I do, but we're talking about music I've already purchased, so the other downloaders out there likely have similar taste that I do. It's a safe bet that I'll like the most popular songs on albums I own. And it's better than the non-existent ratings I have now.
So I set to work. My first obstacle was a big one: I couldn't find a way to progamatically get the iTunes Store popularity information. There didn't seem to be an API, and the web version of the store does not display popularity information. Boo. [Note to self: Insert snarky comment about Apple's locked-down ecosystem before posting this blog entry. Make it a good one.]
On to Plan B. Last.fm publishes the playcount for each of their songs, and their library is quite comprehensive. They even make the data available via API. Bingo.

Now I needed a way to look up a particular song on Last.fm. I tried searching by artist and song name, and this proved pretty accurate, but not perfect. My MP3 collection consists of songs purchased online, ripped from CD, or, uh, otherwise aquired, so the artist/song name info is not always accurate. So I had to fix that first.
The solution to that was to use a piece of free software called Picard from MediaBrainz.com. Using Picard, I was able to correct all the metadata (artist, album, etc.) for my music. It's done semi-automatically, but I'm not going to lie: this took a while (about two evenings). But the payoff was two-fold: 1) All my basic metadata was now correct, and 2) Picard also saves a unique identifier for each song in the metadata. This identifier (they call it the "MBID") is supported by last.fm for song lookups. So now I had a foolproof way to find a song's info on Last.fm.
Using Python, I wrote a script that scans my music folders, and looks up each song on Last.fm. Then, based on their playcount for that song relative to the other songs in the same directory, it assigns a rating, using a 1-5 scale. (I make the not-always-true assumption that one directory is an album, or at least a collection of music that can be compared with each other. There's some room for improvement there.)
- >= 2 standard deviations above mean playcount: 5 stars
- > 1 standard deviation above mean playcount: 4 stars
- within standard deviation from the mean playcount: 3 stars
- > 1 standard deviation below mean playcount: 2 stars
- >= 2 standard deviations below mean playcount: 1 stars
If there are fewer than 4 files in a directory, I don't do this auto-rating, as it's likely that it's just a few "greatest hits" for a performer, and not a full album.
I then record this auto-rating in the "comment" section of the song's ID3 metadata. This does not overwrite any ratings I've already made, as iTunes (and most music players) don't store this in the ID3 data. Now, using iTunes smart playlists, I can find good music using either ratings that I've made or ratings that the last.fm listeners have made for me (giving priority to my ratings).
Finding "clean" music
Ratings are great, but I still can't play music around the kids, for those times that John Maher makes an artistic statement with some grown-up language. Just kidding, I don't listen to John Maher. But swear words are a problem. I'm not particularly uptight about this stuff, but the longer I can prevent my son from asking his Grandmother for "some more f#$%ing Goldfish crackers," the better.
I took a two-pronged approach to this: Find a source to just tell me if something is explicit, and also find a source for the full song lyrics, to figure it out myself. I figure that between the two, I should get a good sense of if the song is kid-safe.
"Explicit" flag
Last.fm does not provide information on explicit lyrics, so I was off to find an additional source. I found it with Amazon, which puts "[Explicit Lyrics]" next to the song name for any naughty songs. Last.fm provides Amazon URLs for each song (presumably with their affiliate ID in there), so I thought I was set. Unfortunately, they take the easy way out and just link to a glorified Amazon search page, so this would not prove 100% accurate. But after a few tests, it seemed reliable enough, so I added it to my script. So if the Amazon page for a song search contains "
I write this information to the comments tag of the id3 info for the song, since there is not an existing field for this in the id3 structure. I guess I'm on new ground here.
All lyrics
For the lyrics, I was able to head back to my trusty friend Musicbrainz, as it links to a lyrics wiki site for many songs. (As a side note, I find it odd that MP3s don't already contain this info when I purchase them). I download the lyrics via a simple HTML scrape, and save them to the id3 info. Then I do a quick scan of the lyrics for some common explicit words, and add the result to the id3 comment section.
Again, using iTunes smart playlists, I can make a list of songs that are marked as non-explicit.
Bonus: Beats Per Minute
Lately I've been running on the treadmill in fits and spurts (both in frequency and running style). I really dislike running without music, so I wanted a playlist of good exercise music. Otherwise I just chant "I want to stop I want to stop I want to stop" over and over. I particularly like when the pace of the music matches my natural running pace. But finding songs that are "just right" for this is time-consuming.
So while I was at it with all this metadata stuff, I came across bpmdatabase.com, which has a respectible database of Beats-Per-Minute info for songs. So since I'm already searching last.fm and amazon for song information, I added this site as well. They don't support MBID, so I just hit their search results page directly using artist and track name. If I get one result (and only one result) back, then I consider that a match.
I save this result to the BPM field in the id3 information.
The bpmdatabase library is by no means comprehensive, so I'm considering supplementing this with software that can calcuate a song's BPM. But I hear they're not terribly accurate, so I'm not in a rush to do that just yet. Also, bpmdatabase is tailored for music that DJs are likely to play, and I think that makes for good exercise music.
The result
Here are some stats I got after running the script on my library of 3531 mp3 files. I now have:
- 3385 auto-rated songs: 674 songs with 4 or 5 star ratings, 187 songs with 1 or 2 star songs
- 2546 "safe" songs / 135 explicit songs
- BPM info for 443 songs
Here's the final script. I make no claims about how well it's coded -- I ended up learning Python for this project, and it shows. It uses a library called "mutagen" for reading and writing the ID3 info. I had hoped to use PHP for this project, but was never able to find a way to write ID3 information using PHP. PHP solutions exist, but involve recompiling PHP, which I wanted to avoid. Python was on my todo list anyway.
The major downside to this project is that each time I get new music, I have to use Picard to get the MBID info, and then I have to remember to re-run my script. I can certainly automate the script piece, but I don't know about the Picard part.
Posted by Kevin at March 26, 2013 09:57 PMGood grief. Just turn on the radio already. ;)
Posted by: Amy at April 12, 2013 03:18 PMAmy, for that comment, your music may "accidentally" get erased. Well, except for Enrique Iglecias, because I'm not a monster.
Posted by: Kevin at April 12, 2013 04:23 PMYour commentary, as well as the fact that you did all of this,is hilarious.
Posted by: Emily at November 6, 2013 08:16 PMDropbox no longer has the script :-(
Wanted the script so I could see where you stored the EXPLICIT info? Wanted to do something similar - to help avoid situations where the children are around. Personally I prefer the original versions for the music, provided explicit language isn't pointless and just there for shock.
Posted by: norkle at February 26, 2014 08:06 PM