NEH Digital Humanities Advancement Grant: 2021-2023
[Scroll to bottom to download full .pdf version]
Andrew McGraw, University of Richmond
Joanna K. Love, University of Richmond
John Vallier, University of Washington Libraries
Andrew Weaver, University of Washington Libraries
Christopher Cotropia, University of Richmond
Louis Epstein, St. Olaf College
Matthew Oware, University of Richmond
Gregg Kimball, Library of Virginia
Doug Turnbull, Ithaca College
Robert K. Nelson, University of Richmond
Yucong Jiang, University of Richmond
Benjamin Leach, Independent Contractor
This white paper discusses findings from a fall 2021–summer 2023 National Endowment for the Humanities project, which convened eleven interdisciplinary humanities scholars, including musicologists, music archivists, I.P. and digital humanities experts, and computer scientists across the U.S. to explore and propose best practices for automatically collecting and archiving digital live music event data by geographic location. Our project was proposed in response an emerging crisis in the scholarship of American music: Although newspaper and “zine” (local fan magazine) archives have always provided important primary sources for humanists analyzing the history and evolution of local music scenes, the digital revolution has forced many local news sources out of business, especially in smaller and mid-sized American cities. Concert publicity has thus largely shifted to web and social media platforms, particularly for lesser-known local artists. Today, many live music performances and almost all “virtual” performances are listed exclusively on websites and social media, and since the emergence of the COVID-19 pandemic, these platforms have become the lifelines for American music scenes. However, the ephemerality and unstructured nature of these online sources, severely limits future possibilities for researching today’s music scenes if they are not captured and made publicly accessible.
Our Level I Digital Humanities Advancement Grant project thus aimed to build upon previous research begun by Dr. Andrew McGraw in 2013 about Richmond, Virginia’s local music scene (McGraw, 2021). In our efforts to expand the Richmond-based archive—which had already provided invaluable revelations about the city’s socio-economic disparities and cultural values from seemingly neutral datasets (such as liquor licensing, venue locations, and noise complaints)—we aimed to generate a larger archive that might open new research possibilities and revelations for future music scholarship, such as helping music scholars and other humanities researchers to better analyze data about music education, access, and infrastructure in particular regions, and finding important interdisciplinary humanities insights about twenty-first century, post-pandemic music-making.
Our project was originally concerned with music scenes in “the age of social media” and we intended to focus our data collection on social media platforms. However, a series of technical changes made this difficult or impossible on certain platforms. During our grant period, Facebook and Twitter (now Meta and X) severely curtailed data available through their APIs, while Instagram put up substantial barriers for scraping. In the end we focused on scraping web-based data and Instagram posts.
We worked with a developer to create a customizable, Chrome extension scraper for capturing live music event data both from websites and Instagram: Live Music Archiver. The scraper went through two extensive development and testing phases, and is currently freely available on GitHub at: https://github.com/broem/live-music-archiver-extension.
For web-based scraping, the user deploys the scraper while on a venue or calendar website, highlighting the area where each component of first event listing is positioned (e.g., Artist Name, Date, Start Time, etc.; see Figure 1). The scraper uses this as a template to iteratively capture event data down the page. The user can edit fields if the capture run does not work as intended. The user then chooses from a dropdown list of options for a capture schedule (every day, week, or month), and manually enters geographic data (State and County FIPS, CBSA, and latitude/longitude) that enables easy integration with a wide range of preexisting socio-economic datasets (Esri, EPA, Gallup, etc.). Once saved, the scraper feeds its data to a Metabase instance that enables easy filtering, dashboard visualization, and analysis. The scraper saves the raw HTML payload and parsed fields and also grabs genre tag information for the artist (if available) from the Last.fm and Spotify APIs.
For Instagram scraping, the user loads Instagram in an internet browser and logs in with their own account information. The user then enters the account names of the venues from which they want to capture post data. The scraper captures the raw HTML payload of every post for these venues, including image auto-captioning, emojis, etc.
For our 1.0 version, members of the grant team deployed the scraper to venue sites and Instagram accounts around the United States, in scenes they knew best. User feedback and the dataset from this preliminary run was then used to refine the scraper’s design and data management for the 2.0 version. The grant co-PIs (McGraw and Love) deployed 2.0 scrapers only in Virginia, generating a sample dataset from the date range: March 20, 2023–May 30, 2023. For the 2.0 test, we deployed 305 web-based scrapers and 87 Instagram profile scrapers. As we discuss further below, only a subset of these web pages and Instagram profiles returned useful data. Our final dataset included 1003 (clean) event listings and 2,273 Instagram posts.
Challenges and Improvements for Future Versions
We benefited greatly from advice about scraping music events given by our collaborator Doug Turnbull, who runs a related project at Ithaca College called Localify. We found that his Localify dataset provided an important comparison for our project. Following Turnbull’s advice, we integrated our scrapers with the genre tag data available through the Last.fm and Spotify APIs and then preprocessed and de-duped (removed duplicates) event listings before feeding them into Metabase. Some differences that arose between our projects include the fact that Localify is highly centralized in his computer science lab, relies heavily on scraping Google, and removes artists not in Spotify’s dataset. By comparison, our project is intended for capturing more granular, local listings through an interface that anyone can download and use. However, as we learned, the low profile, highly-local nature of the events we intended to capture meant that they were often listed on poorly designed websites that proved difficult to scrape. For instance, many “hyper-local” music listings were only posted as text embedded in a single image on a web page or on an Instagram post, or on embedded Google calendars—none of which are easily scrapable. Contrastingly, some highly dynamic, cutting-edge pages were so interactive that our scraper could not identify the structures in which event information was stored. We imagine that future versions of the scraper could potentially overcome these problems using more sophisticated techniques and text recognition (OCR).
Additional challenges we faced involved the inevitabilities of user error. For instance, all geospatial information (FIPS, CBSA, latitude and longitude) is entered manually in the scraper, and even expert users (the co-PIs) sometimes entered erroneous information which required manual correction in the dataset. This could be overcome in the future with drop-down menus listing county names, which would also automatically enter FIPS and CBSA data from a linked census database. Moreover, correct latitude and longitude for venues could be acquired through Google’s API.
De-duping is a significant challenge for any project. However, avoiding overcounting events is crucial when the total number returned in a region is intended to be compared with socio-economic datasets. Duplicate entries are inevitable when scraping multiple sites, especially when scraping individual venue websites in addition to aggregators, such as Google, Eventbrite, and BandsinTown. Deduping is also technically difficult, since the raw HTML—and sometimes even the basic data about an event—may vary slightly between the sites. For our project we developed some basic code to dedupe entries through preprocessing. This included parsing the wide ranges of dating formats on different websites into a single shape, and deduping events that listed the same artist’s name on the same date. However, the dataset still required additional cleaning up by using SQL filters in Metabase, as well as a manual cleaning of the downloaded dataset. As AI technology rapidly improves, we anticipate that more sophisticated deduping techniques will become available.
As mentioned above, several social media platforms closed or restricted access through their APIs between the dates of our original proposal and our development phase. Luckily, we were still able to scrape data from Instagram, despite considerable technical hurdles. Instagram is technically a highly restrictive program, because it uses IP blacklisting and blocks many cloud service providers (on which scraping codes typically run). We therefore had to direct our code through a different provider to mask our IP. Instagram also uses rate-limiting, which forced us to run the scraping code at a snail’s (or a human’s) pace. In the end, the data we collected from Instagram was very rich, but also very unstructured and messy. It is our sense that future archivists will have access to more powerful AI tools for identifying networks and correlations, and therefore will be able to better analyze and use the dataset than we can currently. Nevertheless, we were able to create some basic topic model analysis of our Instagram data, which we present below.
Many of the humanists on our team were interested in capturing genre information for each event. Genre is a highly deconstructed concept within musicology, yet one we seem unable to dismiss. It is a considerably subjective concept that carries deep and complex social meanings and historical baggage. Indeed, notions of genre are deeply intertwined with historically-defined marketing terms set during the early years of the burgeoning U.S. music industry—terms not always based on musical characteristics, but demographic and regional markers, as well as socio-political and economic considerations (cf. Hagstrom Miller, 2010 and Brackett, 2016). Although it is easy to deconstruct intellectually, genre remains a highly salient category for musicians, audiences, and venues. Unfortunately, straightforward genre descriptors were very rarely included in the event listing information we collected. Of course, most venues tend to book a limited range of genres or might even specialize in a few related forms—i.e., variations of rock, country, and bluegrass. We noticed too that genre was sometimes apparent from a band’s name or its visual representation: for example, sharp-edged fonts positioned against black backgrounds are favored among many metal bands. Today, many bands also describe themselves using a wide range of related genres, making categorization difficult. Despite these obstacles, we used various approaches to attempt to capture genre within our dataset, since this marker remains important in defining contemporary music scenes.
Machine Learning Image Detection
Early in our explorations we experimented with machine learning genre identification using a dataset of band images scraped from event listings. We used the Keras package in Jupyter Notebooks, images from our earlier scraping experiments, and selections from our 1.0 version scraper. These images were hand-categorized in a limited number of genre “bins” defined as “metal,” “indie,” “hip hop,” “classical,” “jazz,” “pop,” “rock,” “country.” Using a database of 1000 images for each category, we reached a nearly 70% accuracy rate using Keras. We then achieved nearly equal accuracy with Google’s “teachable machine,” which is easier to use (https://teachablemachine.withgoogle.com). A larger database would likely increase accuracy. We also found that the model worked better on images of musicians themselves, rather than on the stylings of event posters or album covers often used event listings. Our full database of genre images is available here: https://drive.google.com/drive/folders/1CNCurh5E4yiiBvgYQD9Y9pALmUHeQl-H?usp=share_link.
Music industry recommendation systems (e.g., Pandora, Spotify, Last.fm) have historically used the term “tags” to categorize genres. Tags can also include more objective audio features (e.g., tempo and key), as well as culturally determined labels such as “hip-hop,” “rock,” etc. Following the Localify model, we pulled genre tags from both the Spotify and Last.fm APIs, but fewer than half of the artists we captured events for had entries in these systems. This is because both services rank the popularity of their tags, but our events featured artists with negligible rankings in these systems. Notably, Spotify’s tags are generated internally, where Last.fm’s are crowdsourced by their user base. Spotify also applies a smaller vocabulary of tags to a subset of more popular artists (Spotify popularity > 20) and its tag set likely comes from expert sources such as All Music, the Music Genome Project, and skilled human labelers. Last.fm tags, by contrast, can be generated by any user using “free text tokens.” These tend to be highly unstructured and used for a variety of purposes (e.g., personal organization, humor, and subversion). Last.fm tags are therefore very interesting from a sociological perspective, but less useful for accurately categorizing event listings by genre. A case study of white nationalist terms appearing in our sample dataset is illustrative of the challenges with using these API tags.
We noticed that in addition to standard genre tags, the Last.fm API was returning a number of white nationalist descriptors as well, including: “white boy,” “white country,” “white power,” “nazi,” “KKK country,” “MAGA,” “neo nazi,” “racist country,” “redneck,” and “Russian Hate.” These tags were associated with only twenty-one out of 1003 events in our sample dataset and were highly associated with more standard “country” and “Christian” tags. Examining the tags associated with the country singer Cole Swindell, who appeared in our sample dataset, revealed a complex picture. Besides obvious tags such as “country” and “youngstar,” his Last.fm tags also included “KKK country,” “Trump,” and “racist.” Some of Swindell’s tags are likely attempts by his critics on Last.fm to subvert his career or online presence by associating him with terms such as “queercore,” “farts,” and “my nigga.” Other tags might be legitimate though, indicating Swindell’s popularity in the white nationalist subculture. Swindell, along with many other artists with tags on Last.fm, is associated with over fifty distinct tags, many that are nonsensical or of negligible relevance to genre categorization. However, because user generated tags carry complex and nuanced sociological meanings, we find that a deeper of analysis of such tags within a larger dataset would be worthy of future consideration and examination.
The Spotify and Last.fm APIs returned over 4,451 unique genre tags for our sample data set of 1003 live events, but we found most of them useless from an analytical perspective (e.g., it is not useful to categorize “awesome” or “seen live”). To more accurately analyze the returned genre categories, we limited our genre markers to Last.fm and Spotify tags with a popularity ranking of over 90%. We then “binned” the remaining tags into a smaller number of umbrella terms: “rock,” “country,” hip hop,” “religious,” “classical,” “bluegrass,” “folk,” “indie-alternative-singer-songwriter,” “pop,” “jazz-funk,” “RnB-Blues-Soul,” “metal,” reggae-world,” “electronic-dance.” Obviously, any categorization scheme will prove somewhat arbitrary—each of our collaborators might have proposed a different scheme—but we did our best to choose those that were more applicable to our dataset. To see the full list of the data set’s genre tags and their categorizations, see: https://docs.google.com/spreadsheets/d/1pA5X30A1k-XA3IIQ1ZJi8GODxo1cAV4n/edit?usp=share_link&ouid=113725266872517724151&rtpof=true&sd=true. Deleted genre tags can be found here: https://drive.google.com/file/d/1WBIaF6bUmXHmdr5alGrZiXzUmbmytej2/view?usp=share_link.
Although Virginia is finely divided into 134 counties, events in our sample dataset are highly concentrated in only six of them. This is unsurprising since event data correlates most strongly with the total populations and population densities in the state (and also across the U.S., according to the Localify dataset). Our genre data is even more concentrated, with only a few counties having the majority of the returned Spotify and Last.fm tags. This suggests that the view of music scenes from within the music industry itself (i.e., Spotify, Billboard/Nielson ratings, ticket retailers) is geographically restricted as well. This means that Google and other big technology firms scraping music event data are also only able to “see” live music performances in a few select counties in Virginia. The map below visualizes our “binned” genre data from our sample dataset. In this map we calculated a percentage for each genre among all events for each county. We then calculated the average percentage for each, to avoid ignoring genres with low counts (such as “classical”). If the percentage of events for a given genre appears above average, they are visualized as “warm”/orange in the Figure 2. If a genre appears below average, they are represented as “cool”/blue (Figure 2).
Accordingly, the maps illustrate, for instance, that there is more hip hop than average in Virginia Beach (on the Eastern coast), while there is much less than average hip hop around Roanoke (in the Western mountains). We find the opposite is true for the indie-alternative-singer-songwriter and classical genres. There seems to be more scrapable classical and singer-songwriter performances in Roanoke as compared to Virginia Beach. These findings align with common cultural associations between genre distinctions and U.S. demographics: Hip hop artists are believed to be mostly Black, whereas indie music is believed to be performed by white artists. Of course, there is a lot of research that refutes or adds nuance to these generalizations in our twenty-first century globalized cultures. However, as illustrated in Figure 3, the southeastern coast does indeed have a much higher Black population (and also a diverse and transient military population) than the Western mountains, suggesting ways that population demographics might help, in part, to explain our findings.
Digging a little deeper into this data, we can see that Roanoke is the lone dark red spot pictured in the Central-west region of the state, as seen in Figure 3. Even though this county has a substantial Black population as compared to the surrounding mostly-white counties, Roanoke’s Black population is only roughly 30% of the total population of the county compared to the nearly 60% white majority. By comparison, the county of Virginia Beach—on the state’s Southeastern edge—is only roughly 20% Black, but its surrounding counties also have substantial Black populations. This suggests that a certain critical demographic mass can help to foster particular genres within larger metropolis communities. In general, this area potentially provides an overall larger market for hip hop (and other diverse musical styles), and allows for more BIPOC venue owners, as compared to the smaller, less diverse population in the state’s Western mountains.
Venues and Demographics
In preparation for deploying our 2.0 scrapers, the co-PIs used Google to conduct an exhaustive census of live music venues across the state. This ranged from large professional venues, to churches, small town city squares, and restaurants that occasionally host singer-songwriters. If a space listed regular music on their website, we included it in our census. Of course, there are many informal and DIY spaces for public music making as well. We are aware of at least eight regular “house-show” and “underground” venues in Richmond alone, but did not include those spaces in our census, both due to privacy concerns and because their events are rarely publicly visible to Google. These listings are often advertised through private social media networks and only occasionally on public Instagram posts. Additionally, most of live music spaces we identified are not primarily or exclusively music venues. Our venue census identified 346 live music spaces that are linked in an interactive map found here: https://arcg.is/1WfD9e
The presence of live music spaces in Virginia correlates with a range of socio-economic and demographic indicators, as outlined in the correlation matrix in Figure 4 and generated using R-studio. The correlation between venue count and a range of demographic measures (per county) is illustrated in the top row. Internal correlations between demographic measures are indicated in lower rows. The red stars indicate strength of statistical significance; no red stars means that the measure is not statistically significant. In the humanities, any correlation above .2 (20%) is typically worthy of attention because these measures are from hyper-complex real-life data that shares many interacting variables and does not reflect data controlled in a lab. The higher the number (r-squared value) the higher the strength of the interaction between the measures. As always, it is important to remember that correlation is not necessarily indicative of causation. For more on how to read matrix correlations, see: https://r-coder.com/correlation-plot-r/.
To briefly summarize the measures above: The CDC’s Vulnerability index is weakly correlated with venue count (.22), but there is a stronger correlation (.34) between the number of residents living 150% below the poverty line with venue count. A similar correlation strength (.32) is found between venue count and the estimate of single parent homes. Unemployment is also weakly correlated with venue count (.24) and the estimate of homes with no internet is more strongly correlated (.35) with venue count. Surprisingly for the researchers, the total county population was weakly correlated with venue count (.21), but the diversity of the county was more strongly (.35) associated with venue count. Population density was by far the strongest indicator of the number of venues in a county (.53), while racial demographics varied in strength and direction of correlation. What general conclusions might we draw from this? In general, it seems that the number of music venues in Virginia counties is positively correlated with denser, more socio-economically diverse populations. Live music tends to be presented in communities that are more “mixed up,” demographically speaking.
Click though the following links to explore the relationship between demographics and our venue census:
- Venues and homes with no internet: https://arcg.is/1ePr5D0
- Venues and Social Vulnerability Index: https://arcg.is/HXrDf0
- Venues and residents 150% below the poverty line: https://arcg.is/0O4HSe0
- Venues and Diversity Index: https://arcg.is/0Siu18
- Venues and combined demographics: https://arcg.is/PSzuC
Sample Live Event Dataset
Our sample dataset of live events in Virginia from 03/20/23–05/30/23 included 1003 distinct, clean entries. We captured more incomplete entries (a total of 1,297) that were missing crucial information, such as geographic locations, parse-able dates, or listings for non-musical events (e.g., book clubs, screenings, and comedy shows). We used a range of pre-processing scripts for parsing date formats and catching obvious duplicates, as well as a series of SQL filters on the processed dataset. Nevertheless, a certain amount of cleaning needed to be done manually. Our cleaned dataset included events in thirty-four of Virginia’s 134 counties. For comparison, the Localify dataset for 2019 included 1,181 live music events in Virginia for the same period, but because this was collected before the pandemic, it contains many of the music venues have since shut down across the state. Additionally, the Localify dataset includes only artists entered in Spotify’s database, thereby missing many local and amateur artists.
Our sample database can be downloaded here: https://docs.google.com/spreadsheets/d/1vfRoz2v78P4yxrk5OXUFsNbfQVjREGC1/edit?usp=share_link&ouid=113725266872517724151&rtpof=true&sd=true.
The sample database including all captures and raw html payloads can be downloaded here:
In addition to creating the correlation matrix above that analyzes relationships between venues and demographics by county (Figure 4), we developed other matrices useful for analyzing the relationships between the event count from our sample dataset and a range of socio-economic, demographic, and well-being measures. These are presented in Figures 5 and 6.
Predictably, the strongest correlation represented here is between the event count and the number of venues in a county (.66). However, not all scraped events occurred in the venues we identified, and not all of the venues we identified hosted events during our capture dates. For example, many highly local and informal events (such as string-band jams) occurred in libraries or public squares not captured in our formal venue list. Moreover, many of the venues we identified—especially those on the Eastern shore—only host regular music during the peak summer tourist season, leaving gaps in our dataset. Additionally, the matrix reveals that the total population and population density were strongly correlated with event count (.47 and .48 respectively). Esri’s Diversity Index was also positively correlated with the number of music events (.38), as was the percent of Asian (.41) residents. The percent of white and Black residents—by far the two largest demographics in the state—did not correlate with the event count in a statistically significant way. We are unsure what might explain this demographic discrepancy, but suspect that it could be related to the fact that Asian and Hispanic populations in Virginia tend to cluster near metropolitan areas, whereas white and Black populations are more evenly distributed across the state. Other demographic measures illustrated in Figure 5 were only mildly correlated with event count, and these measures were not statistically significant. An exception to this is the represented percentage of the population living 150% below the poverty line. Strangely, this measure (which was weakly statistically significant) was negatively associated with the event count (-.33), whereas it was positively associated with the number of venues in a county, as seen in Figure 4. This might be explained by the fact that only a few venues host the overwhelming majority of events represented in our dataset. These venues tend be the professional, large, and pricey, and in more affluent areas of the state, as opposed to the many restaurants and cafes that only occasionally have music and are scattered throughout the state.
Figure 6 illustrates a second range of political and well-being measures. When including Esri’s Market Potential and Political Leaning dataset, this matrix shows that music event count in a county is strongly associated with local contributions to NPR (.46). Conversely, it suggests that the event count was negatively correlated with estimates of residents in a county expressing a “very conservative” political outlook (-.43). By comparison, our event count data was more mildly correlated with the well-being measures developed by the EPA between 2000–2010. The number of live events was only moderately correlated with “social cohesion” (.31), a sense of “connection to nature” (.32), a sense of “cultural fulfillment” (.32), and self-reported “leisure time” (.32). According to the measures tracked by Simply Analytics (2023), the event count was also strongly negatively correlated with the percentage of county population suffering from hypertension (-.54). Unsurprisingly, most health measures in Virginia are better in the more affluent, metropolitan regions, which also happen to host most of the state’s popular venues and live music events.
We are continuing to collect data through our scrapers as of this writing, and will be interested to see how these correlations evolve as the dataset grows over time. In fact, many of the well-being correlations illustrated in Figure 6 disappear when compared to the Localify’s Virginia dataset for the earlier, pre-COVID period, 2017–2019, which (as mentioned above) includes a much higher number of live events (17,533) across all but one of the state’s counties. Correlations for this dataset are shown in Figure 7. This matrix might not present an entirely appropriate comparison to those listed above because the Localify dataset includes only popular and more established Spotify artists. We expect that a proper comparison dataset including many more local and amateur “community” musicians—the focus of our scrapers—might produce different results, especially in regard to community well-being measures.
Our sample event count dataset is also available through a series of online, interactive ArcGIS maps, similar to that illustrated in Figure 8 below. Examples include:
- Event Count and Population Density: https://arcg.is/1WOujT0
- Event Count and Diversity Index: https://arcg.is/0ze9DH
- Event Count and Contributions to NPR: https://arcg.is/azPn40
- Event Count and Cultural Fulfillment: https://arcg.is/qqiPH0
In thinking through this data, we want to make it clear that our scrapers and our subsequent analyses, are not a measure of the total amount of music, or even the total amount of public live music, that occurred in these counties during the project period. We can only consider this as the total amount of live music that Google’s search engine knows about, since this is how our web scrapers found these events. This important consideration cannot be overstressed, since unfortunately, it is only what Google (and other large tech platforms) could see that will become the future cultural archive of this period. We address the lacunae in more detail below and in our discussion of “music deserts” in our forthcoming roundtable (McGraw et al., 2023).
In addition to our census of physical venues, we created a list of public Instagram profiles associated with live music events in Virginia. Many, but far from all, venues we identified hosted websites in addition to Instagram and Facebook profiles. Additionally, some smaller venues only had an Instagram profile, but no website. Predictably, the number of Instagram profiles for venues closely tracked with the number of physical venues in each county. As with our venue count, the number of Instagram profiles per county appears to track overall with household income. Future analyses might consider the possible relationship between age cohorts in particular counties and Instagram profile counts. In regard to other social media platforms, Facebook (now Meta) is generally considered to be a declining social media platform used mostly by older populations. TikTok is an expanding platform geared towards a younger cohort. At the moment, Instagram lies somewhere in the middle, demographically speaking. It might therefore be useful to analyze correlations between the number of Instagram profiles against the 25–55 age cohort in each county as we collect more data.
Topic Model Analysis
We captured the entire HTML payload for every post in the profiles we followed. This is an extremely “dirty” dataset partly because many venues in Virginia are also, or are primarily, restaurants—a result of the state’s byzantine alcohol laws (McGraw 2021). Many Instagram profiles include as many or more posts about food as about music, although some prioritize musical events in ways their websites cannot—which is, again, due to the food and alcohol laws (see more in McGraw et al., 2023). The dataset is very rich as it includes responses, emojis, and AI auto-captioning for images. Accordingly, our Instagram dataset for the period 03/20/23–05/30/23 includes 2,273 posts. The raw Instagram dataset can be downloaded using the following link: https://drive.google.com/file/d/1YMA2ftFhUe-V9SHVWj8a87gNxL2Sm-g0/view?usp=share_link
We anticipate that such a dataset will be best suited for future AI analysis, considering its complexity. To get a sense of the data’s capabilities, we ran a basic LDA (Latent Dirichlet Allocation) topic model analysis, using 1,000 iterations on two versions of the dataset, which returned some interesting trends. Both models can be downloaded using the links below.
- Topic model analyzing post text and emojis: https://dsl.richmond.edu/rnelson2/musicdata/jslda.html?docs=posttext.txt&stoplist=stoplist.txt&topics=25
- Topic model analyzing full HTML payload: https://dsl.richmond.edu/rnelson2/musicdata/jslda.html?docs=all.txt&stoplist=stoplist.txt&topics=25
The most obvious observation of the chart in Figure 10, which only displays the beginning text in each topic document, is the significant role that food and beer play in the social media ecology of music scenes in Virginia. Reading the full topic documents provided in Figure 11, allows for a more nuanced understanding of the relationships pictured in Figure 10. To analyze one of the stronger associations in the chart, consider the relationship between “ale beer at” (near the bottom of the y axis on the right) and “drinklocal beer stout” (near the right side of the x axis). These are topic documents 16 and 21 respectively, as provided in Figure 11. Topic document 16 includes many references to classical music. Topic document 21 includes several references to craft beer. Our local experience in Richmond suggests a strong association between upper middle class white communities, craft beer (and craft microbreweries), and classical music. This occurs, in part, because The Richmond Symphony regularly performs at the local Hardywood brewery.
To analyze a weak association, consider “open tap room” (near the top of the y axis) and “at country Floyd” (near the right side of the x axis). These are topic documents 4 and 22 respectively, as listed in Figure 11. Topic 4 is almost exclusively concerned with craft beer. Topic 22 is associated with the Floyd Country Store, which is an important venue for old-time, bluegrass, and folk music in rural Floyd county, located in Southwestern Virginia. The Floyd Country Store is very family oriented, and although it offers a full menu, it does not advertise alcohol. The store, renovated in 2007 as a restaurant and cultural center, serves as a “wholesome” patron of local traditional musics, which (perhaps ironically) have long been associated with the illicit moonshining culture of the region’s poorer white communities.
Figure 12 presents a topic analysis of the full HTML payload for our Instagram dataset, including image auto-captioning and other information not visible in the post itself. To analyze one of the stronger associations in the chart, consider the relationship between “music Bristol” (near the top of the y axis on the right) and “and at music tickets” (near the right side of the x axis). These are topic documents 3 and 23 represented in Figure 13. Topic document 3 is concerned primarily with the town of Bristol, often considered the “birthplace of country music.” This town straddles the Virginia, Tennessee border in the rural Southwestern part of the state. Topic document 23 mentions “bluegrass,” “old-time, “country,” “folk” and “carter.” The latter term refers to the famous Carter family, often credited with popularizing early country music, whose ancestral farm—the “Carter Family Fold”—is near Bristol. Topic document 23 therefore describes the primary musical genres programmed in the geography described in topic document 3.
To analyze a weak association in this chart, consider “music at (camel emoji)” (near the top of the y axis) and “blue mountain we” (near the middle of the x axis). These are listed in Figure 13 as topic documents 6 and 16. Topic document six is focused on The Camel, a popular local venue in the city of Richmond that programs an even mix of local and small touring acts, who primarily play rock, punk, and indie. It caters to a younger, urban demographic and is near the heart of the city. By contrast, topic document 16 includes the terms “blue,” “ridge,” “brewery,” “winery,” “afton,” “valley,” “sunset,” and “view.” The Blueridge and Skyline state parks run along the foot of the Appalachian Mountains, crossing Afton Virginia, just past Charlottesville in the Central-west regional of the state. This is a picturesque area, home to many upscale wineries, craft breweries and Bed and Breakfasts catering to an older, whiter, and generally wealthier crowd.
Considerations for Future Work
As we have discussed here, scraping hyper-local website and Instagram events presents significant technical challenges because the data is difficult to capture and is very noisy. However, we anticipate that the rapid development of AI will facilitate creating codes that can better manage this noise. Future AI tools, such as multidimensional analysis and word embeddings, will likely also enable us to identify complex networks in the dataset. And as with all data sets, ours will become more useful and insightful as it expands.
We are therefore continuing to collect and archive data from all of our 2.0 scrapers. In addition to gathering basic information about venues and other live event locations (parks, restaurants, schools, etc.), a larger dataset covering a longer period could be analyzed to see the temporal patterns that unfold over a typical year. Because we also collect information about ticket prices, we might better analyze correlations between event pricing and socio-economic demographics. A more substantial dataset containing more information about genre tags could also be used for “sentiment analysis,” showing in more detail how genre is associated with its associative demographics and the visual cultures (using event images) and language styles (using Instagram responses and emojis) of surrounding populations.
There are, of course, many other ways this data can be further analyzed and used. As such, the team has co-authored a roundtable that digs more into this information and our process, and suggests further possibilities for future research (McGraw et al., 2023). We are motivated to continue work on this project and seek additional grant funding and expert researchers to expand it into a tool that music scholars may use to get a more inclusive and accurate picture of U.S. music scenes in the twenty-first century.
Bracket, David. 2016. Categorizing Sound: Genre and Twentieth-Century Popular Music.
Berkeley: University of California.
CDC/ATSDR Social Vulnerability Index. 2020. Database Virginia. Available at: https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html (Accessed: 10 August 2023).
ESRI (2023) ‘ESRI Market Potential’. Available at: https://doc.arcgis.com/en/esri-demographics/latest/regional-data/market-potential.htm (Accessed: 12 August 2023).
Hagstrom Miller, Karl. 2010. Segregating Sound: Inventing Folk and Pop Music in the Age of Jim Crow. Durham, NC: Duke University Press.
McGraw, Andrew. 2021. Audible RVA. https://audible-rva.org/infrastructure/ (Accessed: 26 June 2023).
McGraw, Andrew. 2022. “Sonic and Affective Geographies in Richmond Virginia.” In Contested Frequencies, 19–41. Edited by Joanna K. Love and Jessie Fillerup. New York: Bloomsbury Press.
McGraw, Andrew et al. 2023. “Digital Humanities Approaches to Archiving U.S. Music Scenes.” Forthcoming roundtable publication.
SimplyAnalytics (2023) ‘Census 2021. Current Estimates Data’. Available at: https://app.simplyanalytics.com/index.html(Accessed: 14 July 2023).
United States Census Bureau (2023a) ‘Quick Facts, Richmond City, Virginia’. Available at: https://www.census.gov/quickfacts/richmondcityvirginia (Accessed: 14 July 2023).
United States Census Bureau (2023b) ‘Quick Facts, Virginia Beach, Virginia’. Available at:
https://www.census.gov/quickfacts/fact/table/virginiabeachcityvirginia/PST045222 (Accessed: 14 July 2023).
United States Environmental Protection Agency (2023) ‘Human Well-Being Index (HWBI) for U.S. Counties, 2000-2010’. Available at: https://catalog.data.gov/dataset/human-well-being-index-hwbi-for-u-s-counties-2000-2010 (Accessed 20 August 2023).