Inside data journalism academia

The network properties and actors shaping research on computational and data journalism

Posted by Simona Bisiani on April 18th, 2022

Data journalism as an academic discipline is relatively new, emerging upon the increased usage of the term to refer to journalism driven by data. In a recent article by DataJournalism.com, they spoke to two academics - Bahareh Heravi (University College Dublin) and Florian Stalph (Ludwig-Maximilians-Universität München) - about the size of the community involved in it. Dr Heravi noted that the opportunity to engage in data journalism research was until recently challenged by the limited existence of graduate programmes with a focus on the subject. As many Master programmes specialising in data journalism have emerged in the last five years, I can then expect the number of PhD students researching data journalism to grow in the near future. This understanding of data journalism as an emerging academic specialisation resonates with Dr Stalph's personal experience of asking his supervisor for the opportunity of doing his Master's thesis on data journalism, which then led him to devolve his PhD studies to it.

The fact that data journalism studies is an emerging discipline might suggest then that the community involved in it is pretty small - but is this really the case? At the same time its importance within the wider framework of journalism research is unmeasured - is there a way to capture the resonance of data journalism within journalism academic literature? With these questions in mind, I decided it was time to get my hands on some data about data journalism literature. I set out to understand different aspects of it, tackling even more questions, such as: which are the most influential papers, and the authors behind them? Are there a lot of collaborations? For how long have researchers been writing about data journalism?

In this article, I explore those questions using Google Scholar data on data journalism studies. I chose Google Scholar as it acts as a popular point of access to academic literature, regardless of institutional credentials, with easy-to-collect metadata about the results of search queries. This way I were able to gather data that provides a good overview of the literature on data journalism. I pulled a search query aimed at capturing the 980 most influential pieces of academic literature on data journalism (see our Methodology section below for more information). I also ran a generic journalism query, that I used to see the relationship between the parent field and the sub-discipline. A little reminder: the analysis that follows should be understood as an exploration of the most influential papers about the field, rather than the academic field in its entirety.

Time

One thing I wanted to capture was the time factor: knowing data journalism is a rather new discipline, for how long have papers about it been produced? And where can I find the most influential papers? One expectation is that, within the "lifetime" of a field of research, some papers have become iconic. This can be the objective reflection of the quality and relevance of that paper for the field, or perhaps the unintended consequence of a "rich get richer" effect: popular papers become even more popular as a result of the signalled importance they emanate through elements just as citations and rank. Another possible explanation is that the earliest papers might also be the most important ones for the field, laying the foundations for data journalism studies.

So I plotted it: the data tells us that most of the top influential literature on data journalism is also the most recent. This is an interesting fact, as it suggests that newest research has a better chance to make it to the top, despite potentially a larger number of papers being produced. It could also mean that many novel papers are groundbreaking - which, being data journalism evolving hand in hand with digitization might not be so strange. But this could also hint to us that a core body of literature about data journalism might still be forming. And yet another possible explanation could be that many papers published in the last few years are the result of better funded research, the result of increasing importance allocated to the field of data journalism.

The oldest entry in our dataset is the 1972's chapter called Precision Journalism by Neil Felgenhauer, appearing in the Magic Writing Machine. The origin of the term dates indeed back to 1971, and describes a form of journalism that adops the quantitative toolkit from the social sciences for the purpose of gathering data and doing analysis in journalism, pioneered by Philip Meyer and some colleagues (see DataJournalism.com Handbook chapter from C.W. Anderson for an in-depth exploration of the history of data journalism). Since then and up until 2010, most literature about data journalism that is still very influential today covered either precision journalism (e.g. Philip's Meyer's 1989 Precision Journalism and the 1988 US elections), the rise in computer usage to do journalism (e.g. Bruce Garrison's 1996 Tools daily newspapers use in computer-assisted reporting), or how to handle polling data (e.g. Arnold H. Ismach's 1984 Polling as a news-Gathering tool). The first occurrence of the term "data journalism" happens in 2010 with Elena Egawhary and Cynthia O’Murchu's The Data Journalism Book, published for the Centre of Investigative Journalism. Now archived, the book provided Excel data handling guidance, and specifically mentions the term "data journalism".

Unlike data journalism, research about the general academic field of journalism goes more back in history. The data shows a clear pattern: the most influential papers about journalism are not necessarily the most recent nor the oldest, rather they are somewhat left-skewed with a peak around 2010.

There are many speculations that can be made about why I see differences in peaks on the time distribution of the most influential papers. But one fact is worth noticing: while the last few years have meant the production of less influential papers about journalism, an increasing proportion of those that were, were also pertinent to data journalism. What this suggests is that there are data journalism papers that are important not just for the academic field of data journalism, but to the whole journalism research field (for transparency, this could be a reverse-causality type of problem: it could also be that very important papers about journalism in general appeared in the data journalism search query too, although this is less probable due to the level of specificity of the search query).

Citations and Ranking

While time is great at hinting the life span of an academic field, there are metrics that allow us to dig deeper and compare instead the importance of one paper relative to another. Google Scholar, in particular, offers two: ranking and number of citations.

In the plot below, I show each paper in our dataset, plotted by year of publication and number of citations. I added a line showing the overall trend, which is negative for all categories: over time, papers appearing in the dataset have on average less citations than prior papers. This goes back to our point raised previously about a potential "rich get richer effect", but even of the fact that many of these papers have existed for a longer period and thus have been available for novel researchers to use as backbone to their research.

One takeaway from the plot below is that papers about data journalism have on average less citations than those about journalism, which, being data journalism newer and more niche, makes sense!

Citations are a strong signalling feature of legitimacy and relevance, but papers with the highest number of citations are not necessarily the top ranking ones. In the plot below I explore the correlation between rank and citations. I find that the correlation is much more noticeable for the journalism papers than for the data journalism ones, but overall quite weak. This plot leaves us with an open question, that with the data at hand is not possible to answer: what drives ranking position in Google Scholar? And while the algorithms operates as a bit of black box, the "about" page, Google Scholar provides some guidance, stating its algorithm aims to rank documents the way researchers do, weighing the full text of each document, where it was published, who it was written by, as well as how often and how recently it has been cited in other scholarly literature.

The ranking position of the category "both", which describes papers appearing both the journalism and data journalism datasets, is the one extracted from the latter. What I notice then is that the many papers that appear in the journalism datasets are top ranking papers within the data journalism one. One example is the number one ranking paper for data journalism: Computational Journalism, written by Sarah Cohen, James T. Hamilton and Fred Turner.

Authors

So who are the people behind the most influential papers about data journalism? Our dataset brought up 1134 unique authors for the 980 papers. The mean number of papers an author has written is 1.39 and the median is 1. The most prolific authors are Ester Appelgren from Södertörn Högskola in Stockholm and Seth C. Lewis from Oregon University, each having written 14 papers appearing in the dataset.

Beyond our data, Dr Appelgren's work has been cited 673 times in total (as of December 2021), and her most popular paper is a 2014's publication where, alongside her colleague Gunnar Nygren, she explored data journalism practices in Sweden. She tells us about her involvement in the field: "I became interested in data journalism mainly because it combines journalism and media technology, and through this convergence both possibilities and challenges arise that are interesting to study and also important to study from a scientific point of view. [...] Ten years ago, together with Journalism professor Gunnar Nygren I started to work closely with data journalists at SVT, in particular Helena Bengtsson and Kristofer Sjöholm to develop a project that aimed at facilitating the development of data journalism in Swedish newsrooms." That project eventually led to the Noda conference, a cooperation between journalists and academia in the Nordic countries.

Seth C. Lewis, on the other hand, has been cited an incredible total of 9064 times, signalling his importance beyond data journalism studies. In fact, his most cited paper is a 2012 journal article that explores journalists usage of Twitter. Seth C. Lewis has also written in the latest edition of the Data Journalism Handbook, a chapter about the datafication of journalism and industry-academy collaborations.

In comparison, the journalism field has 952 unique authors, and the mean number of papers an author has written is 1.50, indicating that the community of people producing the most influential journalism literature is slighly smaller. In fact, the most prolific author here is Mark Deuze of University of Amsterdam, as twenty of his publications are included in the dataset. Interestingly, he is followed by the same Seth C. Lewis of above, appearing here a total of 18 times.

Collaborations

Collaborations are a big part of the picture in academia, as they faciliate the creation of knowledge through sharing resources, as well as allow the flow of expertise across departments and institutions. Out of the 1134 data journalism authors reported above, 551 published alone, which makes the median number of authors per paper 1. The mean number of authors per paper is instead 1.53. The most recurrent collaboration is between the pair of researchers Alfred Hermida and Mary Lynn Young, who have coauthored four publications, including a book about the history of data journalism through a scholarly lens. In the plot below, I show the collaborations in our dataset.

Are academics in data journalism more collaborative than those in journalism studies? From our datasets, this appears to be the case. There were 429 collaborations in data journalism, and 336 in journalism. Similarly, the number of authors publishing alone in journalism is 644, and the mean number of authors per paper is 1.38, indicating a slightly more collaborative production for data journalism.

¨

Text Analysis

Last, I performed a simple text analysis of the term frequency of the terms appearing in the available abstracts (the data collection method allowed to retrieve snippets of those, and they were available for most papers), removed some stopwords, and voila: below you can see the resulting cloudword of the most prominent terms. No surprises there, right?

Wrapping up

In this blogpost I explore data about the top appearing literature about data journalism from Google Scholar. Our analysis shows that the most influential literature about data journalism has nearly all been created within the last decade, unlike for the field of journalism studies as a wole. I noticed that data journalism grows in importance as a discipline within journalism studies as well. Unlike in journalism studies, data journalism academic literature is characterised by less of a positive correlation between citations and ranking, and generally data journalism papers have less citations than generic journalism ones. Finally, the most influential literature about data journalism is the result of a level of collaboration that surpasses that of generic journalism studies. It appears that for for the niche subject of data journalism there is still room for many more authors to produce influential work, whereas in journalism a few authors have produced an astonishing number of top appearing papers.