--- title: "RNewsflow: Tools for analyzing content homogeneity and news diffusion using computational text analysis" author: "by Kasper Welbers and Wouter van Atteveldt" bibliography: references.bib date: "" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to RNewsflow} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r, include=FALSE} options(digits=3) library(knitr) ``` ## Abstract > Given the sheer amount of news sources in the digital age (e.g., newspapers, blogs, social media) it has become difficult to determine where news is first introduced and how it diffuses across sources. > RNewsflow provides tools for analyzing content homogeneity and diffusion patterns using computational text analysis. > The content of news messages is compared using techniques from the field of information retrieval, similar to plagiarism detection. > By using a sliding window approach to only compare messages within a given time distance, many sources can be compared over long periods of time. > Furthermore, the package introduces an approach for analyzing the news similarity data as a network, and includes various functions to analyze and visualize this network. # Introduction The news diffusion process in the digital age involves many interdependent sources, ranging from news agencies and traditional newspapers to blogs and people on social media [@meraz11; @paterson05; @pew10]. We offer the `RNewsflow` R package as a toolkit to analyze the homogeneity and diffusion of news content using computational text analysis. This analysis consists of two steps. First, techniques from the field of information retrieval are used to measure the similarity of news messages (e.g., addressing the same event, containing identical phrases). Second, the temporal order in which messages are published is used to analyze consistent patterns in who follows whom. The main contribution of this package lies in the specialized application of document similarity measures for the purpose of comparing news messages. News is a special type of information in the sense that it has a time dimension---it quickly loses its relevance. Therefore, we are often only interested in the similarity of news documents that occured within a short time distance. By restricting document comparisons to a given time distance, it becomes possible to compare many documents over a long period of time, but available document comparison software generally does not have this feature. We therefore offer the `newsflow_compare` function which compares documents using a sliding window over time. In addition, this package offers tools to aggregate, analyze and visualize the document similarity data. By covering all steps of the text and data analysis in a single R package, it becomes easier to develop and share methods, and analyses become more transparent and replicable. The primary intended audience of this package is scholars and professionals in fields where the impact of news on society is a prime factor, such as journalism, political communication and public relations [@baum08;@boczkowski07;@ragas14]. To what extent the content of certain sources is homogeneous or diverse has implications for central theories of media effects, such as agenda-setting and the spiral of silence [@bennett08;@blumler99]. Identifying patterns in how news travels from the initial source to the eventual audience is important for understanding who the most influential "gatekeepers" are [@shoemaker09]. Furthermore, the document similarity data enables one to study news values [@galtung65] by analyzing what elements of news predict their diffusion rate and patterns. The package is designed to appeal to scholars and professionals without prior experience in computational text analysis. This vignette covers the entire chain from processing raw data---written text with source and date information---to analyzing and visualizing the output. It points to relevant software within and outside of R for pre-processing written texts, and demonstrates how to use the core functions of this package. For more advanced users there are additional functions and parameters to support versatility. `RNewsflow` is completely open-source to promote active involvement in developing and evaluating the methodology. The source code is available on Github---https://github.com/kasperwelbers/RNewsflow. All data used in this vignette is included in the package for easy replication. The structure of this vignette is as follows. The first part discusses the data preparation. There are several choices to be made here that determine on what grounds the content of documents is compared. The second part shows how the core function of this packages, `newsflow_compare`, is used to calculate document similarities for many documents over time. The third part demonstrates functions for exploring, aggregating and visualizing the document similarity data. Finally, conclusions regarding the current version of the package and future directions are discussed. # Preparing the data To analyze content homogeneity and news diffusion using computational text analysis, we need to know *who* said *what* at *what time*. Thus, our data consists of messages in text form, including meta information for the source and publication date of each message. The texts furthermore need to be pre-processed and represented as a *document-term matrix* (DTM). This section first discusses techniques and refers to existing software to pre-process texts and create the DTM, and reflects on how certain choices influence the analysis. Second, it shows how the source and date information should be organized. Third, it discusses several ways to filter and weight the DTM based on word statistics. ## Pre-processing texts and creating the DTM A DTM is a matrix in which rows represent documents, columns represent terms, and cells indicate how often each term occurred in each document. This is referred to as a *bag of words* representation of texts, since we ignore the order of words. This simplified representation makes analysis much easier and less computationally demanding, and as research has shown: ``a simple list of words, which we call unigrams, is often sufficient to convey the general meaning of a text'' [@grimmer13: 6]. As input for this package, the DTM can either be in the `dfm` class of the `quanteda` package [@quanteda] or in the `DocumentTermMarix` class of the `tm` package [@feinerer08]. We recommend using `quanteda`, because the `dfm` class can contain the document meta data (called `docvars` in `quanteda`), and is generally more efficient and versatile. For detailed instructions you can visit the [quanteda website](https://quanteda.io/). Here we will demonstrate one of the easiest ways to create a DTM with the meta data included, based on a data.frame. We first create a corpus with the `corpus` function, and then process this corpus into a DTM with the `dfm` function. ```{r, message=FALSE, warning=FALSE, echo=TRUE} library(quanteda) d = data.frame(id = c(1,2,3), text = c('Socrates is human', 'Humans are mortal', 'Therefore, Socrates is mortal'), author = c('Aristotle','Aristotle','Aristotle'), stringsAsFactors = F) corp = corpus(d, docid_field = 'id', text_field='text') dtm = dfm(tokens(corp)) dtm docvars(dtm) ``` Based on the representation of texts in the DTM, the similarity of documents can be calculated as the similarity of row vectors. However, as seen in the example, this approach has certain pitfalls if we simply look at all words in their literal form. For instance, the words "human" and "humans" are given separate columns, despite having largely the same meaning. As a result, this similarity of the first two texts is not recognized. Also, the words "are", and "is" do not have any substantial meaning, so they are not informative and can be misleading for the calculation of document similarity. There are various `preprocessing` techniques to filter and transform words that can be used to mend these issues. In addition, we can use these techniques to steer on what grounds documents are compared. * First of all, it is advisable to make all terms lowercase, and reduce terms to their root using stemming or lemmatization[^lemma]. Thus, "Hope", "hoped", "hoping", etc. all become "hope". This is because we are (often) interested in the meaning of these terms, and not the specific form in which they are used. * Second, one should filter out irrelevant words. Very common words, stopwords and boilerplate words contain little or no relevant information about news items. Very rare terms, while ignored in many computational text analysis approaches, can be particularly informative for comparing news items, and should probably be kept. * Third, it is also possible to specifically select or filter out only certain types of words by using part-of-speech tagging[^pos]. For instance, to compare whether documents address the same event one can focus on nouns and proper names. * Finally, an alternative approach is to combine words into N-grams (i.e. sets of N consecutive words). This way the comparison of documents focuses more on similarity in specific segments of text, which is useful if the goal is to trace whether sources literally copy each other (in which case most other pre-processing steps can be skipped). [^lemma]: Stemming and lemmatization are both techniques for reducing words to their root, or more specifically their stem and lemma. This is used to group different forms of the same word together. Without going into specifics, lemmatization is a much more computationally demanding approach, but generally gives better results. Especially for richly inflicted languages such as German or Dutch it is highly recommended to use lemmatization instead of stemming. [^pos]: Part-of-speech tagging is a technique that identifies types of words, such as verbs, nouns and adjectives. All mentioned preprocessing techniques except for lemmatization and part-of-speech tagging can be used in quanteda (as arguments in the `dfm` function). To use lemmatization or part-of-speech tagging there are several free to use grammar parsers, such as *CoreNLP* for English [@corenlp] and *Frog* for Dutch [@bosch07]. For more information about preprocessing, please consult [@welbers17]. For this vignette, data is used that has been preprocessed with the Dutch grammar parser Frog. The data is a sample of data used in a study on the influence of a Dutch news agency on the print and online editions of Dutch newspapers in political news coverage [@welbers18]. The terms have been lemmatized, and only the nouns and proper names of the headline and first five sentences (that generally contain the essential who, what and where of news) are used. By focusing on these elements, the analysis in this study focused on whether documents address the same events. This data is also made available in the package as demo data, in which the actual nouns and proper names have been substituted with indexed word types (e.g., the third organization in the vocabulary is referred to as organization.3, the 20th noun as noun.20). ```{r, message=FALSE, warning=FALSE, echo=TRUE} library(RNewsflow) dtm = rnewsflow_dfm ## copy the demo data dtm ``` To compare news across media and over time, it is necessary to have meta data about the date of a document, and convenient to have meta data about the source of a document. In the demo data, this is included as the `docvars` of the `dfm`. Note that if the `DocumentTermClass` class of the `tm` package is used, the meta data has to be managed manually, as a data.frame in which the rows (i.e. documents) match the rows (i.e. documents) in the DTM. ```{r} head(docvars(dtm), 3) ``` ## Using word statistics to filter and weight the DTM As a final step in the data preparation, we can filter and weight words based on word statistics, such as how often a word occured. Since we are analyzing news diffusion, a particularly interesting characteristic of words is the distribution of their use over time. To focus the comparison of documents on words that indicate new events, we can filter out words that are evenly used over time. To calculate this, we provide the `term_day_dist` function. ```{r} tdd = term_day_dist(dtm) tail(tdd, 3) ``` Of particular interest is the `days.entropy` score, which is the entropy of the distribution of words over days. This tells us whether the occurrence of a word over time is evenly distributed (high entropy) or concentrated (low entropy).[^autoboiler] The maximum value for entropy is the total number of days (in case of a uniform distribution). The `days.entropy.norm` score normalizes the entropy by dividing by the number of days. By selecting the terms with low entropy scores, the DTM can be filtered by using the selected terms as column values. [^autoboiler]: Note that this is also a good automatic approach for filtering out stopwords, boilerplate words, and word forms such as articles and common verbs. ```{r} select_terms = tdd$term[tdd$days.entropy.norm <= 0.1] dtm = dtm[,select_terms] ``` In addition to deleting terms, we should also weight terms. Turney explains that ''The idea of weighting is to give more weight to surprising events and less weight to expected events'', which is important because ''surprising events, if shared by two vectors, are more discriminative of the similarity between the vectors than less surprising events'' [@turney02: 156]. Thus, we want to give more weight to rare words than common words. A classic weighting scheme and recommended standard in information retrieval is the term-frequency inverse document frequency (tf.idf) [@sparck72,@monroe08]. This and other weighting schemes can easily be applied using the `quanteda` package, for instance using the `dfm_tfidf` function. ```{r} dtm = quanteda::dfm_tfidf(dtm) ``` # Calculating document similarities Given a DTM with document variables (see `?quanteda::docvars`) that indicate publication time, the document similarities over time can be calculated with the `newsflow_compare` function. The calculation of document similarities is performed using a vector space model [@salton75;@salton03] approach, but with a sliding window over time to only compare documents that occur within a given time distance. Any other columns (in our case "source" and "sourcetype") will automatically be included as document meta information in the output. Furthermore, three parameters are of particular importance. The `hour_window` parameter determines the time window in hours within which each document is compared to other documents. The argument is a vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(0, 36) will compare each document to all documents within the next 36 hours. The `measure` parameter, which determines what measure for similarity is used, defaults to *cosine* similarity[^alternativemeasure]. This is a commonly used measure, which indicates similarity as a score between 0 (no similarity) and 1 (identical)[^cosine_exception]. The `min_similarity` parameter is used to ignore all document pairs below a certain similarity score. In the current example we use a minimum similarity of 0.4, because a validity test in the paper from which the current data is taken found this to be a good threshold for finding documents that address the same events. Whether or not a threshold should be used and what the value should be depends on the goal of the analysis and the data. We recommend manual validation to verify that the similarity threshold matches human interpretation of whether or not two documents are similar, given a pre-defined interpretation of similarity (e.g., same event, same theme). [^alternativemeasure]: Alternatively, the current version also supports *overlap_pct*, which os a measure giving the percentage of overlapping term occurences. This is an asymmetrical measures: the overlap percentage of document 1 to document 2 can be different from the overlap percentage of document 2 to document 1. When using an asymmetrical measure, the direction should be carefully taken into account in the analysis. [^cosine_exception]: If the DTM contains negative values, the cosine similarity score can range from -1 to 1. Note that for the other similarity measures there can be no negative values in the DTM. ```{r results='hide', message=FALSE, warning=FALSE} g = newsflow_compare(dtm, date_var='date', hour_window = c(0,36), min_similarity = 0.4) ``` The output `g` is a network, or graph, in the format of the `igraph` package [@csardi06}. The vertices (or nodes) of this network represent documents, and the date and source of each document are stored as vertex attributes. The edges (or ties) represent the similarity of documents, and the similarity score and time difference are stored as edge attributes. To avoid confusion, keep in mind that from hereon when we talk about vertices or a vertex we are talking about documents, and that edges represent document pairs. An advantage of using a network format is that it combines this data in an efficient way, without copying the document meta information for each edge. This network forms the basis for all the analysis functions offered in this package[^external_sim]. [^external_sim]: If data about document similarities is imported, then the `document_network` function can be used to create this network. This way the functions of this package for aggregating and visualizing the network can still be used. A full understanding of the `igraph` package is not required to use the current package, but one does need some basic understanding of the functions for viewing and extracting the document/vertex and edge attributes. First, vertex and edge attributes cannot directly be extracted using the `$` symbol, but require the functions `V()` and `E()` to be used, for vertex and edge attributes, respectively. These can be used as follows. ```{r} vertex_sourcetype = V(g)$sourcetype edge_hourdiff = E(g)$hourdiff head(vertex_sourcetype) head(edge_hourdiff) ``` Alternatively, all vertex and edge attributes can be viewed or extracted with the `get.data.frame` function. ```{r, fig.show='hold'} v = as_data_frame(g, 'vertices') e = as_data_frame(g, 'edges') head(v[,c('name','date','source','sourcetype')],3) head(e,3) ``` The `weight` attribute of the edges represents the similarity score. The `hourdiff` attribute represents the time difference in hours between two documents, where the value indicates how long the `to` article was published after the `from` article. A histogram can provide a good first indication of this data. ```{r, fig.width = 7, fig.height = 3} hist(E(g)$hourdiff, main='Time distance of document pairs', xlab = 'Time difference in hours', breaks = 150, right=F) ``` In the histogram we see that most document pairs with a similarity score above the threshold are about an hour apart (1 on the x-axis). This is mainly because the online newspapers often follow the news agency within a very short amount of time. As time distance increases, the number of document pairs decreases, which makes sense because news gets old fast, so news diffusion should occur within a limited time after publication. ## Tailoring the document comparison window If news diffuses from one source to another, then the time difference cannot be zero, since the source that follows needs time to edit and publish the news. This delay period can also differ between sources. Websites can adopt news within minutes, but newspapers have a long time between pressing and publishing the newspaper, meaning that there is a period of several hours before publication during which influence is not possible. Thus, we have to adjust the window for document pairs. To make it more convenient to adjust and inspect the window settings for different sources, we offer the `filter_window` and `show_window` functions. The `filter_window` function can be used to filter the document pairs (i.e. edges) using the `hour_window` parameter, which works identical to the `hour_window` parameter in the `newsflow_compare` function. In addition, the `from_vertices` and `to_vertices` parameters can be used to select the vertices (i.e. documents) for which this filter is applied. This makes it easy to tailor the window for different sources, or source types. For the current data, we first set the minimum time distance for all document pairs to 0.1 hours. Then, we set a minimum time distance of 6 hours for document pairs where the `to` document is from a print newspaper. ```{r} # set window for all vertices g = filter_window(g, hour_window = c(0.1, 36)) # set window for print newspapers g = filter_window(g, hour_window = c(6, 36), to_vertices = V(g)$sourcetype == 'Print NP') ``` For all sources the window has now first been adjusted so that a document can only match a document that occured at least 0.1 hours later. For print newspapers, this is then set to 6 hours. With the `show_window` function we can view the actual window in the data. This function aggregates the edges for all combinations of attributes specified in `from_attribute` and `to_attribute`, and shows the minimum and maximum hour difference for each combination. We set the `to_attribute` parameter to "source", and leave the `from_attribute` parameter empty. This way, we get the window for edges *from* any document *towards* each source. ```{r} show_window(g, to_attribute = 'source') ``` We see here that edges towards "Print NP 2" have a time distance of at least 7.4 hours and at most 35.1 hours. This falls within the defined window of at least 6 and at most 36 hours. # Analyzing the document similarity network Before we aggregate the network, it can be informative to look at the individual sub-components. If a threshold for document similarity is used, then there should be multiple disconnected components of documents that are only similar to each other. With the current data, these components tend to reflect documents that address the same or related events. Decomposing the network can be done with the `decompose()` function from the `igraph` package. ```{r} g_subcomps = decompose(g) ``` The demo data with the current settings has `r length(g_subcomps)` sub-components. To visualize these components, we offer the `document_network_plot` function. This function draws a network where nodes (i.e. documents) are positioned based on their date (x-axis) and source (y-axis). ```{r, fig.width = 7, fig.height = 3} gs = g_subcomps[[20]] # select the second sub-component document_network_plot(gs) ``` The visualization shows that a news message was first published by the news agency on June 26th. Soon after this messages was covered in an online newspaper, and the next day it was also covered in a print newspaper. The grayscale of the edges indicates the level of similarity, showing that the online article was much more similar to the original newsagency article. By default, the "source" attribute is used for the y-axis, but this can be changed to other document attributes using the `source_attribute` parameters. If a DTM is also provided, the visualization will also include a word cloud with the most frequent words of these documents. ```{r, fig.width = 7, fig.height = 4} document_network_plot(gs, source_attribute = 'sourcetype', dtm=dtm) ``` These visualizations and the corresponding subcomponents help us to qualitatively investigate specific cases. This also helps to evaluate whether the document similarity measures are valid given the goal of the analysis. Furthermore, they illustrate how we can analyze homogeneity and news diffusion patterns. For each source we can count what proportion of its publications is similar to earlier publications by specific other sources. We can also analyze the average time between publications. Another useful application is that we can use them to see whether certain transformations of the network might be required. Depending on the purpose of the analysis it can be relevant to add or delete certain edges. For instance, in the previous visualizations we see that the print newspaper message matched both the newsagency message and two online newspaper messages. If we are specifically interested in who the original source of the message is, then it makes sense to only count the edge to the newsagency. Here we demonstrate the `only_first_match` function, which transforms the network so that a document only has an edge to the earliest dated document it matches within the specified time window[^duplicate]. [^duplicate]: If there are multiple earliest dated documents (that is, having the same publication date) then edges to all earliest dated documents are kept. ```{r, fig.width = 7, fig.height = 3} gs_onlyfirst = only_first_match(gs) document_network_plot(gs_onlyfirst) ``` ## Aggregating the document similarity network This package offers the `network_aggregate` function as a versatile way to aggregate the edges of the document similarity network based on the vertex attributes (i.e. the document meta information). The first argument is the network (in the `igraph` class). The second argument, for the `by` parameter, is a character vector to indicate one or more vertex attributes based on which the edges are aggregated. Optionally, the `by` characteristics can also be specified separately for `by_from` and `by_to`. This gives flexible control over the data, for instance to aggregate *from* sources *to* sourcetypes, or to aggregate scores for each source per month. By default, the function returns the number of edges, as well as the number of nodes that is connected for both the `from` and `to` group. The number of nodes that is connected is only relevant if a threshold for similarity (edge weight) is used, so that whether or not an edge exists indicates whether or not two documents are similar. In addition, if an `edge_attribute` is given, this attribute will be aggregated using the function specified in `agg_FUN`. For the following example we include this to analyze the median of the `hourdiff` attribute. ```{r} g_agg = network_aggregate(g, by='source', edge_attribute='hourdiff', agg_FUN=median) e = as_data_frame(g_agg, 'edges') head(e) ``` In the edges of the aggregated network there are six scores for each edge. The `edges` attribute counts the number of edges from the `from` group to the `to` group. For example, we see that *Newsagency* documents have 434 edges to later published *Online NP 1* documents. The `agg.hourdiff` attribute shows that the median of the hourdiff attribute of these 440 edges is 1.26 (1 hour and 16 minutes). In addition to the edges, we can look at the number of messages (i.e. vertices) in the `from` group that matched with at least one message in the `to` group. This is given by the `from.V` attribute, which shows here that 384 *Newsagency* documents matched with a later published *Online NP 1* document[^lowerthanedges]. This is also given as the proportion of all vertices/documents in the `from` group, as the `from.Vprop` attribute. Substantially, the `from.Vprop` score thus indicates that 64.32% of political news messages in *Newsagency* is similar or identical to later published *Online NP 1* messages. Alternatively, we can also look at the *to.V* and *to.Vprop* scores to focus on the number and proportion of messages in the `to` group that match with at least one message in the `from` group. Here we see, for instance, that 87.88% of political news messages in *Online NP 1* is similar to or identical to previously published messages *Newsagency* messages. The *from.Vprop* and *to.Vprop* scores thus give different and complementary measures for the influence of *Newsagency* on *Online NP 1*. [^lowerthanedges]: Note that the `edges` score is always equal to or higher than the `from.matched` score, since one document can match with multiple other documents. ## Inspecting and visualizing results The final step is to interpret the aggregated network data, or to prepare this data for use in a different analysis. We already saw that the network data can be transformed to a common data.frame with the `as_data_frame` function. Alternatively, `igraph` has the `as_adjacency_matrix` function to return the values for one edge attribute as an adjacency matrix, with the vertices in the rows and columns. This is a good way to present the scores of the aggregated network. ```{r} adj_m = as_adjacency_matrix(g_agg, attr= 'to.Vprop', sparse = FALSE) round(adj_m, 2) # round on 2 decimals ``` Alternatively, an intuitive way to present the aggregated network is by visualizing it. For this we can use the visualization features of the `igraph` package. To make this more convenient, we also offer the `directed_network_plot` function. This is a wrapper for the plot.igraph function, which makes it easier to use different edge weight attributes and to use an edge threshold, and has default plotting parameters to work well for directed graphs with edge labels. For this example we use the `to.Vprop` edge attribute with a threshold of 0.2. ```{r, fig.align='center', fig.width=7, fig.height=5} directed_network_plot(g_agg, weight_var = 'to.Vprop', weight_thres = 0.2) ``` For illustration, we can now see how the results change if we transform the network with the `only_first_match` function, which only counts edges to the first source that published a document. ```{r, fig.align='center', fig.width=7, fig.height=5} g2 = only_first_match(g) g2_agg = network_aggregate(g2, by='source', edge_attribute='hourdiff', agg_FUN=median) directed_network_plot(g2_agg, weight_var = 'to.Vprop', weight_thres = 0.2) ``` The first network is much more dense compared to the second. In particular, we see stronger edges between the print and online editions of the newspapers. In the second network only the edges from the news agency towards the print and online newspapers remain, meaning that edges between newspapers all score less than 0.2. This implies that many of the edges between newspapers that could be observed in the first network resulted from cases where both newspapers adopt the same news agency articles. Note that the second network is not better per se. It is possible that the initial source of a message is not the direct source. For example, a blog might not have access to the news agency feed, and therefore only receive news agency messages if they are published by another source. Thus, the most suitable approach depends on the purpose of the analysis. One of the goals of creating this package is to facilitate a platform for scholars and professionals to develop best practices for different occasions. ## Alternative applications of this package There are several alternative applications of the functions offered in this package that are not covered in this vignette. Here we briefly point out some of the more useful alternatives. In the network_aggregate function it is possible to use different vertex attributes to aggregate the edges for `from` and `to` nodes. A particularly interesting application of this feature is to use the publication date in the aggregation. For instance, with the following settings, we can get the proportion of matched documents per day. Here we also demonstrate the `return_df` parameter in the `network_aggregate` function. This way the results are returned as a single data.frame in which relevant edge and vertex information is merged. ```{r} V(g)$day = format(as.Date(V(g)$date), '%Y-%m-%d') agg_perday = network_aggregate(g, by_from='sourcetype', by_to=c('sourcetype', 'day'), edge_attribute='hourdiff', agg_FUN=median, return_df=TRUE) head(agg_perday[agg_perday$to.sourcetype == 'Online NP', c('from.sourcetype', 'to.sourcetype', 'to.day','to.Vprop')]) ``` Looking at the *to.Vprop* score, we see that on 2013-06-01, 78.95% of *Online NP* messages match with previously published *Newsagency* messages[^notfrom]. This way, the aggregated document similarity data can be analyzed as a time-series. For instance, to analyze whether seasonal effects or extra-media developments such as election campaigns affect content homogeneity or intermedia dynamics. The `return_df` feature is convenient for this purpose, because it directly matches all the vertex and edge attributes (as opposed to the `as_data_frame` function). [^notfrom]: The *from.Vprop* cannot be interpreted in the same way, because it gives the proportion of all messages in the `from` group. If one is interested in the *from.Vprop* score per day, the *by.from* and *by.to* arguments in *network_aggregate* need to be switched. Note that one should not simply aggregate both *by.from* and *by.to* by date, because then only documents that both occured on this date will be aggregated. Another useful application of this feature is to only aggregate the `by_to` group, by using the document name in the `by_from` argument. This way, the data in the *by_to* group is aggregated for each individual message. In the following example we use this to see for each individual message whether it matches with later published messages in each sourcetype. This could for instance be used to analyze whether certain document level variables (e.g., news factors, sensationalism) make a message more likely to be adopted by other news outlets. We set the `edge_attribute` to "weight" and `agg_FUN` to max, so that for each document we can see how strong the strongest match with each source was. ```{r} agg_perdoc = network_aggregate(g, by_from='name', by_to='sourcetype', edge_attribute='weight', agg_FUN=max, return_df=TRUE) docXsource = xtabs(agg.weight ~ from.name + to.sourcetype, agg_perdoc, sparse = FALSE) head(docXsource) ``` Finally, note that we have now only compared documents with future documents. We thereby focus the analysis on news diffusion. To focus on content homogeneity, each document can be compared to both past and future documents. By measuring content homogeneity aggregated over time, patterns such as longitudinal trends can be analyzed. # Conclusion and future improvements We have demonstrated how the *RNewsflow* package can be used to perform a many-to-many comparison of documents. The primary focus and most important feature of this package is the `newsflow_compare` function. This function compares all documents that occur within a given time distance, which makes it computationally feasible for longitudinal data. Using this data, we can analyze to what extent different sources publish the same content and whether there are consistent patterns in who follows whom. The secondary focus of this package is to provide functions to organize, visualize and analyze the document similarity data within R. By enabling both the document comparison and analysis to be performed within R, this type of analysis becomes more accesible to scholars less versed in computational text analysis, and it becomes easier to share and replicate methods at a very detailed level. The data input required for this analysis consists solely of textual documents and their corresponding publication date and source. Since no human coding is required, the package enables large scale comparative and longitudinal studies. Although the demonstration in this vignette used a moderate sized dataset, the `newsflow_compare` function can handle much larger data and is fast, thanks to a dedicated Rcpp (c++) implementation of sparse matrix multiplication for comparing documents within the given date window. The validity of the method presented here relies on various factors; most importantly the type of data, the pre-processing and DTM preparation steps and the similarity threshold. It is thus recommended to use a gold standard, based on human interpretation, to test its validity. In a recent empirical study (author citation, forthcoming) we obtained good performance for whether documents addressed the same events and whether they contained identical phrases by determining thresholds based on a manually coded gold standard. Also, by using a news website that consistently referred to a news agency by name, we confirmed that we could reliably identify news agency influence for this format. Naturally, there are always limitations to the accuracy with which influence in news diffusion can be measured using only content analysis. However, we generally do not have access to other reliable information. The advantage of a content analysis based approach is that it can be used on a large scale and over long periods of time, without relying on often equally inaccurate self-reports of news workers. The current package contributes to the toolkit for this type of analysis. Our goal is to continue developing this package as a specialized toolkit for analyzing the homogeneity and diffusion of news content. First of all, additional approaches for measuring whether documents are related will be added. Currently only a vector space model approach for calculating document similarity is implemented. For future versions alternative approaches such as language modeling will also be explored. In particular, we want to add measures to express the relation of documents over time in terms of probability and information gain. This would also allow us to define a more formal criterion for whether or not a relation exists, other than using a constant threshold for document similarity. Secondly, new methods for analyzing and visualizing the network data will be explored. In particular, methods will be implemented for analyzing patterns beyond dyadic ties between news outlets, building on techniques from the field of network analysis. To promote the involvement of other scholars and professionals in this development, the package is published entirely open-source. The source code is hosted on GitHub---https://github.com/kasperwelbers/RNewsflow. # Practical code example ```{r eval=FALSE} library(RNewsflow) library(quanteda) # Prepare DTM dtm = rnewsflow_dfm ## copy demo data tdd = term_day_dist(dtm) dtm = dtm[,tdd$term[tdd$days.entropy.norm <= 0.1]] dtm = dfm_tfidf(dtm) # Prepare document similarity network g = newsflow_compare(dtm, hour_window = c(-0.1,60), min_similarity = 0.4) g = filter_window(g, hour_window = c(6, 36), to_vertices = V(g)$sourcetype == 'Print NP') show_window(g, to_attribute = 'source') g_subcomps = decompose(g) document_network_plot(g_subcomps[[55]], dtm=dtm) g_agg = network_aggregate(g, by='source', edge_attribute='hourdiff', agg_FUN=median) as_adjacency_matrix(g_agg, attr='to.Vprop') directed_network_plot(g_agg, weight_var = 'to.Vprop', weight_thres=0.2) g2 = only_first_match(g) g2_agg = network_aggregate(g2, by='source', edge_attribute='hourdiff', agg_FUN=median) as_adjacency_matrix(g2_agg, attr='to.Vprop') directed_network_plot(g2_agg, weight_var = 'to.Vprop', weight_thres=0.2) ``` # References