DATA
Collection of song data from the Top 100 Songs appearing in Billboard magazine each year from 1965 to 2023.
About the Data
The 1965-2015 portion of the data was aggregated by walkerkq. They gathered data across 50 years of the Top 100 songs featured in Billboard Hot 100 Lists.
The 2016-2023 portion of the data was gathered by our team, using a similar procedure to ensure proper continuity of the dataset. In addition, we added several variables of interest to our dataset through web scraping, cleaning, and standardization of our data. You can view the updated data here.
METHODOLOGY
Cleaning
With the initial stage of data preparation, we began on RStudio. Our initial step was to import the dataset and aggregate it to create basic visualization in order to explore our data for a brief overview on variables we currently have. This process of exploratory data analysis helped our group take note of certain trends to narrow the topic of our research. By taking note of the potential categorizing our songs into genres, we wanted to also incorporate this variable into our study.
In order to categorize each observation into a genre, we needed to add an additional column to our dataset titled “genre”. We developed several Python scripts, validated via Spotify API through the Spotify library, that facilitate linking an artist to a specific genre. At this stage, technologies that search individual songs and obtain the genre are automatically utilized through the API to identify songs within our data through their Spotify Artist ID to obtain their specific genres. The Python script examines every individual line of the imported CSV dataset to run Spotify searches to collect such data. This expanded dataset facilitates an in-depth investigation of genres, encompassing the range of artists’ styles, and provides fresh insights. We then exported our dataset, stored in a CSV file, for the data cleaning stage. Our dataset serves a dual purpose: setting a basis for further analysis facilitating data exchange, and serving as documentation for future research endeavors on this topic.
In the data cleaning process, we took note that there were inconsistencies in how certain artists were labeled in the original raw dataset. For example, bands titled “____ and ____” such as “Adam and the Ants” will have variations in how it is spelled within the dataset: “Adam and the Ants”, “Adam and the Ants”, or “Adams & the Ants”. If we did not take this into account, our visualizations would consider these spellings as three different bands. In order to make our data more consistent, we utilized regular expressions in R to find and capture these patterns for each artist/band and reformat the spelling to be one consistent format: “Adam and the Ants”. Additionally, artists that feature additional artists would also be originally considered as two separate entities. For example, “Zedd ft. Foxes” would be considered a separate artist from “Zedd”. In order to make this case more consistent, we used regular expressions and considered the song’s artist being the main creator and not who was featured. For missing lyrics within our data, we replaced the NA values with a string taking its place indicating lyrics were not found. We did not remove rows that contained missing data because we did not want to introduce the possibility of potentially skewing the distribution of our dataset.
Visualizing
Using ggplot2 visual instruments through R as well as Tableau, we were able to visualize the trends in our data as well as examine genres change over time. Through line charts and bar graphs, we saw how genres shifted in popularity over time, reflecting changes in musical tastes. To perform text analysis for visualizations such as word clouds, the lyrics column was also examined in depth. We utilized the tidytext library in R to break the lyrics text down into single words or n-grams through a process known as symbolism. A key aspect of this technique is the removal of common and customized pause words to focus mainly on lyrical phrases. Checking the frequency of words can help us identify words commonly used within each particular genre. We presented our findings through word clouds to facilitate quick identification of key terms in the category. This integrated approach leverages R’s analytical capabilities to process and visualize data analytically, as well as access to Spotify’s vast music database using Python’s Spotify library.
By combining these approaches, our methodology provides a comprehensive examination of musical style, lyrics and artists’ histories, including a thorough exploration of the dataset being studied.
DATA CRITIQUE
What’s in our data set
Our completed data set includes information about every song from the Billboard Year-End Top 100 from 1965 to 2023. For each song, our data has the song’s ranking on the respective year’s Top 100 chart, the song title, the song’s artist, the year the song appeared on the Billboard chart, the song’s lyrics, and the song’s genre. Our data set has 5,880 observations in total.
What our data can and can’t illustrate
Our data can illustrate many aspects of trends in music over time, given that it spans roughly six decades. Using text analysis, our data can illustrate lyrical trends such as most frequently used words by genre, by decade, or artist, the change in lyrics over time, and the change in song length over time through analysis of the amount of words per song. In addition, we established the popularity of each genre throughout different decades and explored the trend of different music.
However, talking only about the data, we can’t see the impact of external factors that are correlating to the visualization. For example, economic conditions, technological advancements or significant cultural events, without additional contextual data or articles can be challenging to explore. Also, we are missing some non-lyrical elements of music from the dataset, for example, melody, harmony and rhythm, which plays a significant role in the appeal and innovation of music but not be captured through lyrical or genre analysis alone.
Gaps and cleaning issues in the data
stop words and meaningless characters in the lyrics
As we explore the sentiment and text analysis of lyrics from each song, many stop words and meaningless characters show up on visualizations. Therefore, we have to develop a more detailed dictionary to exclude the stop words and use string operations in R to clean up the meaningless characters. The extra dictionary is based on the original “stopwords” dictionary from package “tidytext”, on top of that, we select a few significantly high frequency words from the entire lyrics data. In addition, to keep the lyrics that are readable in English, we only kept the 26 alphabets. In this process, we also engage in the identification and removal of numerals and punctuation marks that do not serve any significant purpose in the context of lyrical analysis. This step ensures the textual data is not only clean but also uniformly structured for analysis.
overly detailed genre labels
Also, according to the genre data that we collected from Spotify API, it is too specific and made the analysis fragment, which makes it difficult to tell the trends across broader musical styles. Therefore, we simplified the genre labels by grouping them into broader categories. For example, instead of having multiple sub-genres of rock, indie rock or hard rock, we all classified them under Rock. This approach helped us to have a more clear view of the dataset and the facts.
Ontologies in the data
Ranking data from BillBoard are a key indicator of music popularity, primarily reflecting trends based on sales, radio play and only streaming. Yet, as we work through this data, we recognize it only scratches the surface of the global musical landscape. It misses the nuanced cultural preferences and diverse musical tastes that vary by region, which are crucial for a fuller understanding of music’s role in society. Different regions have unique musical preferences influenced by their cultural, historical and social context. For example, certain genres may dominate in specific areas due to local traditions or the influence of regional artists. Additionally, the way people engage with music also varies widely, adding layers of complexity that Billboard cannot capture.