The Music Industry

Introduction

Welcome to our project for the course 02805 Social graphs and interactions.

Our objective with this project is to analyze the music industry using data from Spotify and Genius. We want to try to visualize a network of musicians from all types of music, find hidden structure in the music industry, and try to gain new insights into the different genres as well as develop a tool to discover and explore new music. We want to see if we can relate the network structure to genres, lyrical sentiment, and other characteristics in the data. We will look into lyrical themes to identify patterns for certain topics and genres, and see how it has evolved over the years.

We gathered the data for the analysis using the web APIs from Spotify and Genius [1][2]. You’re welcome to download the datasets using the button below and check out our Explainer Notebook, which shows the development process step-by-step. In addition, there is a brief description of the datasets below.

Download Datasets Explainer Notebook

Overview of the datasets

Artist dataset (CSV) - 5.6MB:
Description:
List of ~100k artists scraped from spotify using the ‘related artists’ Spotify API.
Most important variables:

Spotify artist ID
Artist name
Followers
Popularity score

Edges (CSV) - 44MB:
Description:
List of network edges created using the ‘related artists’ Spotify API. The ‘from’ variable is the artist whose page points to the ‘to’ artists page.
Variables:

Artist ID (from)
Artist ID (to)

Genres (JSON) - 6MB:
A .json file which uses artist IDs as keys and a list of genres associated with that artist. Gathered using the spotify ‘related artists’ API.
Keys: Artist ID
Values: List of genres

Lyrics (CSV) - 137MB: Description:
A table with all the songs we got from the ‘search’ and ‘song’ APIs from Genius. The lyrics were scraped with the ‘lyricsgenius’ python package. Most important variables:

Artist name
Song name
Song Lyrics
Release date
Lyrics (raw)
Lyrics (tokenized and lemmatized)

Network

First let’s see what our data looks like. We have a dataset of 4031 of the most popular and most followed artists on Spotify.

To visualize this data we’ll use a network graph where each artist in our subset is represented by a node and each edge between artists is created using a feature in Spotify called “related artists”. This connects an artist to a set of other artists which people might also like.

The image on the right contains exactly 4,031 nodes and 35,192 edges.

Community detection

To gain more insights into the music industry network, a community detection is applied by using the Louvain method. The algorithm works in the following way:

First each node is assigned to its community (i.e. each node is its community). Then for each node, change in modularity is computed by removing the node from its community and moving it into the community of each of its neighbours.
Then every node is placed into the communities in which they ended last time. Then each community is treated as a single node and the connections between the nodes are used in the communities as weighted edges between the new nodes.

Then it starts over using the aggregated community from before as the new network. It results in the following communities:

31 communities are found for the giant connected component of the network. By viewing the bar plot, there is an apparent theme within each community.

Ed Sheraan, Ariana Grande and Rihanna are all pop musicians.
Drake, Migos and Eminem are rappers.
Queen, Guns N’ Roses, the Beatles are old time rock and rollers.
The Weeknd, Nicki Minaj and Kendrick Lamar are known hip hop artists.
Zac Efron and Hugh Jackman have played in musical movies.
Stormzy, KSI and Jake Paul are all youtubers.

To analyze the communities even further, the artists’ nodes in the network are colored by the community they belong to. For each node, you can explore the artist’s name, Spotify’s popularity score, the number of followers, and other artists the musician is linked to. The size of each node is determined by the amount of followers the artist has. That way, the largest nodes become the artists that most people are familiar with. The biggest ones are Ed Sheeran, Drake, Eminem, Justin Bieber, Rihanna and Ariana Grande.

The communities show trends as many of the artists of the same community belong to the same genre. Several separate networks can be seen on the figure below. The smaller connected components, bottom right of the figure, consist of more obscure, diverse genres from all around the world such as Indian Bollywood artists, Afropop, and Japanese rock. The largest connected component contains more mainstream genres like pop, rock, rap, hip hop and more.

Feel free to press the “Explore the network” button to find further information on your favorite artists.

Explore the network!

The pink and orange clusters include a mix of similar genres such as 2010-2020 pop, hip hop, dance pop, electro house, and R&B with artists such as Beyoncé, Miley Cyrus and Calvin Harris. The bright yellow community looks like an emo/punk/rock cluster with artists like Marilyn Manson and Linkin Park. The gold-brown community includes the old-school rock and rollers - Queen, Pink Floyd and the Beatles. Blue is obviously rappers, consisting of Drake, Eminem and Kanye West. The turkish blue community contains latino artists with Shakira as the most popular musician of that cluster.

Giant connected component

Let’s take a closer look at the largest connected component, this time we will generate a network with a different approach. We’ll spatialize the network using the Force Atlas 2 layout, with strong gravity so that we can view all the nodes at the same time. The nodes are colored according to their community. Due to the layout, adjacent communities are more related than communities that are far apart. Additionally, the internal position of every node within a community represents how related it is to the surrounding communities.

Explore the network!

From the figure above, we can see that Queen and Elvis Presley are in the same community. Queen is closest to the Guns N’ Roses-AC/DC community, while Elvis Presley is closest to the Ella Fitzgerald-Miles Davis community. The different positions in the layout are confirmed by the artists’ time periods alone.

Betweenness centrality

Betweenness centrality measures the number of times a node lies on the shortest path between two other nodes. Therefore, nodes with high betweenness play a vital role when it comes to connecting the whole network. Without them, disruption might occur in links between nodes, as they dominate the shortest paths between the nodes.

Now the node size is changed according to the betweenness centrality to identify which artists are most important when it comes to connecting all the artists in the network. The nodes that become the biggest in the network have the highest betweenness centrality, and therefore, influence the flow within the artist network.

The figure shows that the artist that is most important in acting as a bridge in the network is Pitbull. This could indicate that Pitbull is related to artists from several other communities (various genres). The relation between artists represents what people like. Therefore, if people like Pitbull, they might also like the artists he ties together from different clusters.

This shows that Pitbull truly can call himself Mr. Worldwide, which, in fact, he does on every occasion.

Sentiment

Artists use their song lyrics to convey their message, feelings, and emotions. Songs are all about making the listeners feel what the artist feels, to enhance the way you experience music.

It is interesting to analyze how emotions are shown in song lyrics and detect which songs and artists convey more positive or negative emotions. Furthermore, explore the genres and detect whether some genres tend to be darker/sadder than others, e.g., is there an apparent difference in happiness between rap and pop?

The LabMT wordlist is used to classify the tokenized words from 12,000 English lyrics by 2,142 artists. The list includes common words and their average happiness score that can be applied to songs, artists, and genres.

The artists considered are only those who have 3 songs or more in the analyzed data. For many of the uncommon genres, the dataset only contains a few songs which are not considered sufficient to describe the whole genre. Therefore, the only genres examined are those that consist of 100 songs or more.

The happiness average for each musician is found by calculating the mean of the artist's songs’ happiness averages. Similarly, the genre happiness average is the mean of each genre’s artists’ happiness averages.

Sentiment Network

The network below is colored according to artists’ average happiness score. The node gradient is based on the sentiment where blue colors show artists with a tendency to include sadder words in their lyrics while the red ones embrace happier words.

It is interesting to analyze the two contrary clusters in the top left section. By zooming in it can be seen that one cluster includes a lot of blue rock/metal artists like Linkin Park and Slipknot while the red cluster contains old jazz/funk artists like Bill Withers and Louis Armstrong.

The network displays a lot of mixed clusters and somewhat “neutral” artists. To further understand why so many artists are “neutral”, let’s look at the distribution of the songs’ average happiness score.

The lyrics’ average happiness score follows a normal distribution with a low standard deviation, meaning that most songs have a similar happiness average score. We are interested in looking into the most extreme ones and see what differentiates them from the others. What songs, artists, genres are the happiest/saddest?

Sentiment distribution for the 10 most common genres

Themes between genres can be very dissimilar. How does lyrical content differ across genres?

The KDE plot below displays the relative sentiment distribution within each genre, by considering the average happiness score of every artist belonging to the genre. The area for each genre is equal to 1.

Play around with the plot and compare the sentiments between your favorite genres!

As you might have observed from the plot, contemporary country, and alternative metal are very different.

Top 10 Lists

The table includes the 10 happiest/saddest songs, artist and genres. Feel free to skim through it and check whether your favorite songs, artists or genres express intense emotions. You can also check out our observations on the right.

Lyrics sentiment over time

As seen in the table above, many of the happiest songs and artists are old-timers that were famous back in the 50s, 60s or 70s. The music industry is constantly changing and evolving, and therefore, it is interesting to see the changes in music sentiments over the years.

Have a look at the figures below. On the left, the blue dots represent the songs, from the year they were released and their average happiness score. The red line shows the yearly average happiness score. The graph on the right shows the average happiness score of each decade, with a downward trend over the decades.

The graphs show a slight downward trend. Two explanations come to mind, one being that music has slowly become sadder over time. The other explanation is one of sample bias. Because we chose artists based on popularity and number of followers today it might just be that people who listen to older musicians like to listen to happier musicians rather than the sadder ones.

At first glance, it seems that the newer music exists more in both extremes than the older music, but looking at the standard deviation we see that this is likely not the case. We may simply be seeing more extreme songs because we have more new songs in our dataset than old ones.

The two possible conclusions remain.

On average, it’s the happier songs that live on
Music has seen a slow and steady decline to sadder and sadder lyrics.

Text Analysis

Before we dive into the text analysis let’s quickly stop to think about a metric called TF-IDF. TF-IDF stands for term frequency-inverse document frequency and is a metric for the relative importance of a word in a given song. It considers how many times the word appears in the given song (term frequency) as well as how many other songs the word appears in (document frequency). The more common the word, the less important it is, the more often the word appears, the more important it is.

For this section, we first computed the TF-IDF for each word in every single one of our >10.000 songs. Then, because we don’t want longer songs to count more than shorter songs, we give each song a combined weight of 1.

Having seen the difference between some genres from the sentimental analysis, we want to show what’s actually behind the difference in these numbers. For this calculate the average TF-IDF value for each genre and then create word clouds where the size of the words is proportional to the weight of the given word. Let’s first look at the separate extremes of the sentiment spectrum: contemporary Christian music (ccm) and underground hip hop.

The word clouds definitely tell a story. The ccm word cloud is filled with religious words and the underground hip hop word cloud is filled with curse words and derogatory terms. However, most genres are probably not as extreme as these two cases. Let’s look at 10 dissimilar genres from the top 25 most popular ones and try to see if the themes make sense.

Topic extraction

Non-negative matrix factorization (NMF) is a method designed to extract patterns or ‘parts’ from high-dimensional data, such as the TF-IDF data we have here. These patterns should present themselves as exaggerated topics of songs. Examples of these might be a 'love song' topic or a 'nostalgic' topic. Such topics would be described as weights for each word. The higher the weights, the more likely the word is to be present in the topic. For the 'love song' topic, we might expect to see high weights for words such as 'love', 'miss', and 'need' since these are common in such songs.

We're going to try to extract some themes from the TF-IDF data frame and then select a few reasonable topics from and create some word clouds to see if they seem reasonable or familiar.

Here we see three very familiar and common topics in songs layed out as word clouds. The topics are ‘Home’, ‘Dancing’ and ‘Love’.

Two other topics we noticed were the only two that contained the words ‘Boy’ and ‘Girl’ in them. We were interested in seeing what the contents of these two songs might be, so here you can see them as word clouds.

Looking at the two extracted topics, we see an unsettling reality about the message being sent by the music industry regarding gender roles. We see that words associated with girls are those that indicate objectification and demeaning viewpoints regarding women. Then we see a stark contrast when we look at the word cloud where ‘boy’ appears. This word cloud shows a narrative of adventure with words like ‘special’, ‘jungle’, ‘jewel’ and ‘horses’. These results didn’t really surprise us, since this is a known problem in the world and especially in the music industry, but it was still a bit shocking to see this layed out so clearly and emerging from data rather than opinions and articles.