1 Introduction
In order to understand complex systems we need to know how their components (or entities) interact with each other. Networks (or graphs) offer a common language to model such systems, where their entities are represented by nodes and their interactions by edges Barabasi2016 . The networked point process is a stochastic model for these systems, when data take the form of a time series of random events observed in each node. That is, in each node of a network we have a temporal point process, which is a random process whose realizations consist of the times of isolated events. Networked point processes are probabilistic models designed to analyze the influence that events occurring in a node may have on the events occurring on other nodes of the network.
Recently, several fields used networked point processes to understand complex systems such as spiking biological neurons
Monti2014 , social networks Blundell2012 ; Silva2009 geosensor networks Hallac2017 , financial agents in markets Namaki2011 , television records Xu2016 and patient visits Choi2015 . One of the main objectives in these analyses is to uncover the causal relationships among the entities of the system, or the interaction structure among the nodes, which is also called the latent network structure. Typically, this is represented by a graph where edges connect nodes that affect each other and edge weights represent the intensity of this pairwise interaction. To the best of our knowledge, all methods that tackle this problem are based on Hawkes point process Hawkes1971A ; Hawkes1971B with a Granger causality framework Granger1969 imposed to retrieve the causal graph from data Achab2017 ; Xu2016 ; Zhou2013 ; Linderman2015 ; Linderman2014 . A point process is said to Granger cause another point process when the past information of can provide statistically significant information about the future occurrences of . We can thus encode causal relationships as a matrix Dhamala2008 ; Didelez2008 . In a multivariate point process, this notion of causality can be reduced to measuring if the conditional intensity function of is affected by the previous timestamps of Xu2016 .In contrast with the popular choice of using Hawkes process to model interacting processes, Wold processes Wold1948
have been neglected as a possible model. Wold processes are a class of point processes defined in terms of the joint distribution of interevent times. Let
be the waiting time for th event since the occurrence of the th event. The main characteristic of Wold processes is that the collection of interevent timesfollows a Markov chain of finite memory. That is, different from Hawkes processes, whose intensity function depends on the whole history of previous events, the probability distribution of the
th interevent time depends only on the previous interevent time . When Wold processes are able to mimic the dynamics of complex systems VazdeMelo2013 ; VazdeMelo2015, this Markovian property can potentially boost the performance of learning algorithms as in our setting. Wold processes were proposed over sixty years ago, however, they remain scarcely explored in machine learning, which is unfortunate. As we will demonstrate in this paper, Wold processes can be both fast and accurate for some learning tasks.
We here present the first approach to tackle the discovery of latent network structures in point process data using Wold processes. Similar to recent efforts Achab2017 ; Xu2016 , our goal in this work is to extract Granger causality Granger1969 from multivariate point process data only. The main reason to consider the Wold process as an alternative to the Hawkes process is its Markovian structure. By adding Dirichlet priors over the mutual influences among the processes, we exploit the Markov property to develop learning algorithms that are asymptotically fast. We call our approach GrangerBusca. With being the number of observations in all processes and the number of processes, the state of the art approaches learn at a cost of Xu2016 ( being defined by hyperparameters) or Achab2017 per iteration. GrangerBusca, in contrast, learns at a cost of .
Equally important, our results show significant improvements over the state of the art methods. For instance, when we measure the Precision@10 score between our Granger causal estimates and the groundtruth number of mentions of the commonly explored Memetracker data Leskovec2009 , our results are at least three times more accurate than the most recent and most accurate method Achab2017 .
In summary, our main contributions are: (1) We present the first approach to extract Granger causality matrices via Wold processes; (2) We develop asymptotically fast algorithms to learn such matrices; (3) We show how GrangerBusca is much faster and more accurate than state of the art methods, opening up the potential of Wold processes for the machine learning community.
2 Background and Related Work
A temporal point process is a probability model for a collection of times indexing the occurrences of random isolated events. Our context considers several point processes observed simultaneously, where each event is associated to a single point process: . Let be the union of all timestamps from all point processes with being the total number of events. We denote by the random cumulative count of the number of events up to (and including) time in process . The collection is an equivalent representation of the point process .
The conditional intensity rate function is the fundamental tool for modeling and making inferences on point processes. Let be the random history of the process up to, but not including, time , called the filtration. Similarly, is defined as the filtration considering the collection of all point processes. In the limit, as , the conditional intensity rate function is given by . The interpretation of this function is that, for a small time interval , the value of is approximately equal to the expected number of events in . It can also be interpreted as the probability that process has at least one event in the interval. The notation emphasizes that the conditional intensity at time depends on the random events that occurred previously.
The commonly explored Hawkes process is defined by its set of conditional intensity functions:
(1) 
We can consider processes as being enumerated from . captures the exogenous (Poissonian) background arrival rate of process . captures the influence of the point process on . has to be integrable, nonnegative, and with when . Usual forms of are the exponential and powerlaw functions Bacry2015 . With from Eq (1), there is evidence that Granger causes when there exists a time where Xu2016 .
In contrast to the Hawkes process, a Wold process Daley2003 ; Wold1948 is defined in terms of a Markovian transition on the interevent times (increments). Let be the stochastic process of time increments associated with point process . The Markovian transition between increments is established by the transition probability density which measures the likelihood of given the value of the previous increment Wold1948 ; Daley2003 .
It is important to point out that, for most forms of , the model is analytically intractable Guttorp2012 . However, when has an exponential form, such as , the model has several interesting properties Cox1955 ; Daley1892
. In this particular form, the next increment is exponentially distributed with rate
where is the previous increment, i.e.: . For the particular case of , the work of Cox Cox1955 and Daley Daley1892 ; Daley2003 derive the stationary distribution of increments showing that it is of the form: .GrangerBusca is motivated by recent efforts Alves2016 ; VazdeMelo2015 ; VazdeMelo2013 that employ variations of the exponential Wold process (defined above). Instead of defining the Wold process in terms of its interval exponential rate, such efforts defined the process in terms of the conditional mean
of an exponentially distributed random variable. This class of point processes is called
selffeeding processes (SFP). For the particular case of and , with being the Euler constant, VazdeMelo2015 ; VazdeMelo2013 showed that the stationary behavior can be very well approximated by the more tractable LogLogistic distribution. This new form of specifying a Wold process is interesting because it is able to capture both exponential and powerlaw behavior, disparate behaviors simultaneously observed in real data Alves2016 ; VazdeMelo2015 ; VazdeMelo2013 . Realizations of this process tend to generate bursts of intense activity followed by long periods of silence.Busca Alves2016 is another point process model based on Wold processes and it is GrangerBusca’s starting point. Starting from a SFP model with , the authors accommodate the less frequent long periods of waiting times observed in some real datasets, the process adds a background Poissonian rate (). The conditional intensity can thus be derived given by:
(2) 
where with being defined by . That is, the index associated to the closest previous event to . is thus the previously observed increment.
Over the years, several Hawkes methods have been created for different applications with varying asymptotic costs Bao2017 ; Blundell2012 ; Chen2017 ; Chevallier2015 ; Li2017 ; Rizoiu2017 . We now discuss the methods most related to GrangerBusca. Achab et at. Achab2017 presents a approach. Xu et al. Xu2016 learns kernels at cost of . Here, represents the number of basis functions used to approximate the kernel noparametrically. Similarly Yang et al. Yang2017 discusses a nonparametric Hawkes that does not explore the infinite memory. The authors show that after events, the kernel is adequately estimated. However, the proposed method still scales in the order of ^{1}^{1}1The authors discuss a cost for a parallel algorithm with hyperparameters being . While this is correct asymptotically, we compare, for all methods, the nonparallel cost with hyperparameters. In practice, hyperparameters have a multiplicative effect on learning time. Moreover, for every parallel method (GrangerBusca included), the reduction factor of ( from ) is only possible via unbounded parallelization (one process per CPU), being unfeasible for large data: ., represents again a predefined number of functions used to approximate the kernel, is the maximum number of previous events to be considered. A similar proposal is presented by Etesami et al. Etesami2016 . In contrast, GrangerBusca has a computation complexity of
per iteration. We achieve this low cost by employing a Bayesian inference where at each step the algorithm learns, for each event in
, which other process, if any, caused that event. Past efforts have employed similar sampling approaches on Hawkes based models Du2015 ; Linderman2014 ; Linderman2015 which again do not scale.It is important to point out that Isham Isham1977 was one of the first and few to discuss multivariate Wold processes. However, the author was mostly focused on analytical properties of the Multivariate Wold Process on a very specific setting (an infinite process defined on the unit circle).
3 Model
We define GrangerBusca using Figure 1, which exemplifies the behavior of the model with two processes. In (a), we show the events of each process as a horizontal line of symbols, triangles in the upper row for process , and circles in the bottom row for process . The unfilled symbols represent events that are caused in an exogenous way, not triggered by any other event. We shall detail how to label events in Section 4. The filled symbols are events that appear as a causal influence from some other previous event. We say that these events have been excited by the previous occurrence of other events. The directed edge connects the origin event to the resulting event. Edges crossing the two parallel lines of events represent the crossexcitement component. In this case, one event in a given process stimulates the occurrence of events in another process. Another situation is due to selfexcitement, when events in a given process stimulate further events in the same process. These are represented by the horizontal arrows in Figure 1(a). Figure 1(c) shows the behavior of the random and the random conditional intensity function . Notice that behaves like a step function. That is, the rate of arrivals is fixed until the next arrival. The figure also shows the burstidleness cycle observed with GrangerBusca, as well as the crossexcitation.
To formalize GrangerBusca, let us first redefine as a function . Recall that is the index of the previously observed event in that is closest to . Also recall that is the last observed increment. The function notation will simplify our extension to a multivariate Wold process. See Figure 1(b) for an illustration. Let us now define define as the event in that is closest to , that is . With such an index, we can denote by the difference between and , i.e.: When the expected values of are small, events in usually precede . In this sense, one intuition on how GrangerBusca works is that larger observed values of lead to weaker evidence for the influence of process on process .
The above behavior motivates GrangerBusca’s multivariate conditional intensity function:
(3) 
The random intensity function is the sum of two components. The first one is and it represents the exogenous events in process that are generated at a Poissonian constant rate by unit of time. The other component is a sum over all processes, including the same process. The terms in this sum represent the increment on the baseline rate contributed by other previous events from itself (selfexcitement) or from other processes (crossexcitement). Based on a very sparse representation, the entire history is concentrated only on the most recent time gaps between events. Hence, the process influence on process is represented by the ratio . The numerator is a scale factor measuring the amount of crossexcitement: when is equal to zero, there is no influence from on . The denominator models how this crossexcitement takes place. At time , we add the time gap to the flat value . A large gap makes the contribution of the ratio small relative to the baseline rate . Otherwise, a small gap raises this contribution up to its maximum possible contribution rate of .
Model parameters and definitions: contains the full set of model parameters. is the Granger matrix of the model. We require that and that . Hence, the value in the is proportional to the fraction of events from process that triggered events on . By definition, is row stochastic Horn1990
. This property leads to several interesting characteristics of the model, that we further develop throughout the rest of this section. The vector
captures the the overall influence strength for each process . That is, when a process influences another, it does so by exponentially distributed interevent times with a mean of . As we now show, to guarantee stationary conditions, it is necessary that . Finally, we consider that each event has a latent label indicating either that it is exogenous or which process caused it (edges in Figure 1(a)). Estimation of would be trivial if these labels were known. Thus, our learning algorithm (see Section 4) focuses on estimating on such labels from data.As pointed out, GrangerBusca’s stationarity (some authors also call this property stability Daley2003 ; Linderman2014 ) depends on . Let be the diagonal matrix where each diagonal cell is equal to . Moreover, let . Each value in is the crossintensity for a pair of processes. If are represented in the rows and in the columns, the expected number of events at an infinitesimal region for is equal to the row sum of this matrix. This value is simply GrangerBusca’s crossfeeding intensity without the exogenous factor. Now, let be the induced norm (maximum row sum).
Definition 1 (Stationarity): Notice that, Horn1990 . The proof for this property is straightforward and has interesting implications. As is rowstochastic, we have . Also, since and . The matrix will either scale down this norm or keep it unchanged. The spectral radius is . Also, .
Due to the above definitions, the model is stationary/stable as it will never generate infinite offspring. In fact, the total number of offspring at any given time is determined by the sum . Next, we explore the definition of Granger causality for point processes Didelez2008 ; Xu2016 .
Definition 2 (Granger Causality): A process is said to be independent of any other process when: for any . In contrast, is defined to Granger cause when: , where . As a consequence, given two processes and whose intensity functions follow Eq (3), Granger causality arises when ^{2}^{2}2When interpreting results from GrangerBusca, we relax the above condition of strict independence to . In such cases, we can state that we have no statistical evidence for Granger causality.
4 Learning GrangerBusca
Recall from Figure 1 that GrangerBusca exhibits a cluster like behavior. That is, exogenous observations arrive at a fixed rate, with each observation being able to trigger new observations leading to a burst/idleness cycle. Based on this remark, we developed our Markov Chain Monte Carlo (MCMC) sampling algorithm to learn GrangerBusca from data. Our algorithm will work by sampling, for each observation , the hidden process (or label), that likely caused this observation. In other words, we sample a latent variable, , which takes a value of when process influences . When the stamp is exogenous, we set this value to a constant . Such a number merely represents a label (exogenous) and does not affect our sampling.
To simplify our learning strategy, we decided to fix . We notice that such a value shown to be sufficient for GrangerBusca to outperform state of the art baselines (see Section 5). Later in this section, we shall discuss that our learning algorithm is easily adaptable for general forms of the intensity function. is estimated based on the number of events that caused on , whereas is estimated as the maximum likelihood rate for a Poisson process based on the exogenous events for . We can thus learn GrangerBusca
with an Expectation Maximization approach. Hidden labels are estimated in the Expectation step. The matrix
can also be readily updated in such a step. With the labels, estimated in the maximization step. Next, we discuss our fitting strategy.Initially, we explored the Markovian nature of the process to evaluate at a cost. Given some labels’ assignments for the events, we obtain with a binary search over and . We explain now how we update the labels. Given any event at in process , the event either exogenous or induced by some other previous event on or from some other process . By the superposition theorem Daley2003 ; Linderman2014 ; Linderman2015 ; Du2015 , we obtain the conditional probability that an individual event at is exogenous or was caused by process (where can be equal to for SFP like behavior) as:
(4) 
where the operator indicates causality. Notice that, is equivalent to . Eq (4) is carried out conditionally on the history .
We can accelerate substantially our evaluations by using a binary modified Fenwick Tree Fenwick1994 (the F+Tree Yu2015 ) data structure if we break the label assignment into two steps. First, we decide if it is exogenous. Given it is not, we select the inducing process based on the conditional probability:
(5) 
The evaluation of the probability that an event is not exogenous has an cost because we estimate where is the last event before that arrived from an exogenous factor^{3}^{3}3The operator means equality by definition.. This probability is the complement of the probability that zero Poissonian events happened between and . As is row stochastic, we first add a Dirichlet prior over each row of this matrix (). We finally sample the hidden labels as follows:
For each process

Sample row from as
For each process

For each observation

Sample

When

Otherwise
Sample


Model parameters are estimated through a MCMC sampler. Starting at an arbitrary random state (i.e., labels’ assignment), let be the number of times influenced . Similarly, captures the number of times influenced any process, including itself. The conditional probability of hidden labels for every observation, , is given by: . By factoring , labels are sampled based on the collapsed Gibbs sampler Steyvers2007 . Thus, we rewrite Eq (5) below. In this next equation, we point out the Proposal and Target Distributions used on the Metropolis Based Sampler (discussed next).
(6) 
where being the current estimate of , with being the count for the pair excluding the current assignment for and being similarly defined. Thus, the full algorithm follows an EM approach. After labels are assigned in the Expectation step, we can compute for every process by simply estimating the maximum likelihood Poissonian rate. Given that it takes time to compute and time to compute Eq 5, the total sampling complexity per learning iteration for the Gibbs sampler will be of .
Speeding Up with a Metropolis Based Sampler: The factor in the traditional Gibbs sampler may be reduced by exploiting specific data structures, such as the AliasTable Yuan2015 ; Li2014 or the F+Tree Yu2015 . In order to speedup GrangerBusca, we shall employ the latter (F+Tree). The tradeoffs between the two are discussed in Yu2015 . Our choice is motivated by the fact that the F+Tree does not make use of stale samples. Thus, we can perform multinomial sampling with a cost. Given that the AliasTable cannot be updated quickly, it is usually only suitable at later iterations Yuan2015 ; Li2014 .
We exploit the F+Tree by changing our sampler to a Metropolis Hasting (MH) approach. Using the common notation for an MH, let
be the proposal probability density function as detailed in Eq
6. Here, is a candidate process which may replace the current assignment . When the target distribution function is simply Eq 6, i.e., , we can sample the assignment using the acceptance probability In other words, at each iteration we either keep the previous or replace with according to the acceptance probability. As discussed, with the F+Tree, we can sample from in time. We can also update the tree with the new probabilities after each step with the same cost. Given that F+Tree has a cost to build, we build the tree once per process. Finally, as required for the Metropolis Hasting algorithm, it is trivial to see that the proposal distribution is proportional to the target distribution Hastings1970 .Parallel Sampler: With the F+Tree, candidates are sampled at a ) cost per event. Moreover, finding previous stamp costs . By adding these two costs, the algorithmic complexity of GrangerBusca per iteration is . Finally, notice that the sampler is easy to parallelize. That is, by iterating over events on a perprocess basis, we parallelize the algorithm by scheduling different processes to different CPU cores. Overall, only vector of variables is shared across processes ( in Eq (6)). In our implementation, each core will read for each process before an iteration. After the iteration, the value is thus updated globally. Our sampler falls in the case of being Asynchronous with Shared Memory as discussed in Terenin2017 .
Learning different Kernels Consider the equivalent rewrite of , where, for GrangerBusca in particular. With this new form, the model acts as a mixture of intensities (). Each pairwise intensity is weighted by the causal parameters . Now, notice that using this form our EM algorithm is easily extensible for different functions . The Estep is able to estimate the causal graph (considering that Eq (6) ). The MStep provides maximum likelihood estimates for the specific parameters appearing in . In fact, even unsupervised forms may be learned. As we discuss in the next section, we keep the aforementioned intensity given that it is simpler and outperforms baselines in our datasets.
5 Results and Experiments
We now present our main results. GrangerBusca is compared with three recently proposed baselines methods: ADM4 Zhou2013 , HawkesGranger Xu2016 and HawkesCumulants Achab2017 . The code for each method is publicly available via the library tick^{4}^{4}4https://github.com/XDataInitiative/tick. Results produced with version 0.4.0.0.. Experiments were performed on a dedicated Azure Standard DS15 v2 instance with 20 Intel Xeon CPU E52673 v3 cores and 140GB of memory. We point out that we perform comparisons are performed with Hawkes methods given that it is the most prominent framework. There is no Wold method for our task. We compare with approaches that are: parametric Zhou2013 and nonparametric Xu2016 ; Achab2017 , and explore finite Achab2017 and infinite Zhou2013 ; Xu2016 memory.
Hyper Parameters: ADM4 enforces an exponential kernel and with a fixed rate. We employ a Treestructured Parzen Estimator to learn such a rate Bergstra2011 , optimizing for the best model in terms of loglikelihood. For HawkesGranger, we fit the model with basis functions as suggested in Achab2017 . Finally, HawkesCumulants Achab2017 also has a single hyper parameter called the halfwidth, which was also estimated using Bergstra2011 . Training is performed until convergence or until 300 iterations is reached. Our MCMC sampler executes for 300 iterations with and .
Datasets: We evaluate GrangerBusca and the aforementioned three baselines on 9 different datasets, all of which were gathered from the Snap Network Repository^{5}^{5}5https://snap.stanford.edu/data/. Out of the nine datasets, we pay particular attention to the Memetracker data, which is the only one used to evaluate all past efforts. The Memetracker dataset consists of webdomains linking to one another. We preprocess the Memetracker dataset using the code made available by Achab2017 . The other eight datasets consist of source nodes (e.g., students or blogs) sending events to destination nodes (e.g., messages or citations). Each datasets can be represented as triples . The ground truth network is defined as the graph , where the vertices are both the sources and the destinations. Edges, and , represent the relationship between two entities. The weighted adjacency matrix of this graph, , is our groundtruth. It is defined as: To compose each process from these datasets, each destination node represents a process. In compliance to our notation, we call such processes . Notice that we do not consider source nodes, , as processes. Thus, the models will aim to extract causal graphs based on the likelihood that reception of messages at a destination will trigger incoming messages. If this is the case, we have evidence that is also be a source node. Finally, we preprocess the data to only consider destinations that have also sent messages.
# Proc (K)  # Obs. (N)  N (Top100)  Span  %NZ  P@5  P@10  P@20  TT(s)  

bitcoinalpha Kumar2016  3,257  23,399  2,279  5Y  0.2%  0.26  0.14  0.07  3 
bitcoinotc Kumar2016  4,791  33,766  2,328  5Y  0.1%  0.25  0.14  0.07  7 
collegemsg Panzarasa2009  1,313  58,486  10,869  193D  1.1%  0.36  0.30  0.19  1 
email Leskovec2007 ; Yin2017  803  327,677  92,924  803D  3.74%  0.23  0.28  0.32  4 
sxaskubuntu Paranjape2017  88,549  879,121  58,142  7Y  0.006%  0.25  0.13  0.06  2774 
sxmathoverflow Paranjape2017  16,936  488,984  59,602  7Y  0.07%  0.28  0.16  0.09  98 
sxsuperuser Paranjape2017  114,623  1,360,974  64,866  7Y  0.006%  0.26  0.14  0.07  4614 
wikitalk Leskovec2010b ; Paranjape2017  251,154  7,833,140  211,344  6Y  0.003%  0.25  0.14  0.07  27540 
memetracker100 Leskovec2009  100  24,665,418    9M  9.85%  0.30  0.29  0.22  114 
memetracker500 Leskovec2009  500  39,318,989    9M  4.44%  0.30  0.30  0.23  274 
Metrics: We evaluate GrangerBusca and its competitors using the Precision@n, Kendall Correlation and Relative Error Scores. Each score is measured per node (or row of
), and is summarized for the entire dataset as the average over every node. Both the Kendall Correlation, as well as the Relative Errors, were proposed as evaluation metrics for networked point processes in
Achab2017 . Precision@n captures the average fraction of edges in out of the top edges ordered by weight which are also present in . As we shall discuss, there are several limitations with the Kendall and Relative Error scores due to graph sparsity. We argue that Precision@n measured at different cutoffs () is a more reliable evaluation metric for large and sparse graphs, as the ones we explore here.Results: Table 1 describes the datasets used in this work presenting their number of nodes and of observations. Most baselines do not execute on large datasets in under 24h of training time (TT), in contrast with GrangerBusca. Given the asymptotic complexity, we estimate that some models may take months to execute. Hence, to allow comparisons with GrangerBusca, we considered subsamples containing only the events involving the Top100 destinations. We pay particular attention to Top100 and 500 nodes for Memetracker, given that it was explored in prior efforts Zhou2013 ; Xu2016 ; Achab2017 .
Table 1 also presents the Precision@n scores for the GrangerBusca. Recall that ours is the only approach able to execute on the full sets of data. Firstly, notice that the Kendall and Relative Error scores are absent from Table 1. Given that datasets are sparse – as shown by the fraction of nonzeros in the ground truth, or %NZ, in Table 1 – the Kendall Correlations and Relative Errors are unreliable metrics for large networks. It is well known that complex networks as ours have longtailed distributions for the edgeweights Barabasi2016 , leading to the high sparsity levels (%NZ) seen in Table 1. With most cells being zeros, Kendall Scores also tend to zero as most sources connect to few destinations. Similarly, the relative errors will likely increase. In order to avoid divisions by zero, previous efforts impose a constant penalization, of one, when zero edges exist between two nodes. Similar to the Kendall Correlation, this penalization will also dominate the score due to the sparsity.
Precision@5  Precision@10  Precision@20  Kendall  Rel. Error  TT(s)  

HC  GB  HC  GB  HC  GB  HC  GB  HC  GB  HC  GB  
top100  0.06  0.30  0.09  0.29  0.01  0.22  0.05  0.26  1.0  0.44  87  114 
top500  0.01  0.30  0.01  0.30  0.02  0.23  0.08  0.20  1.8  0.06  715  274 
Because of the above limitations of prior efforts and metrics, we are unable to present a fair comparison with competitors on the full datasets. To achieve this goal, in Table 2, we present the overall scores for GrangerBusca and the HawkesCumulants (HC) Achab2017 approach, focusing only on the Memetracker data. In this setting, HawkesCumulants has already been shown to be more accurate and faster than ADM4 Zhou2013 and GrangerHawkes Xu2016 (GH). Observe that GrangerBusca is more accurate than the competing method in every score. Indeed, Precision@n scores are at least three times greater depending on the cutoff (Precision@5, 10 and 20). Kendall Scores also show substantial gain (250%), with GrangerBusca achieving 0.20 and HC achieving 0.08 correlations on average. Relative errors for GrangerBusca are also 56% lower than HC (1.0 versus 0.44). Finally, notice how GrangerBusca is slightly slower than HC when fewer nodes are present (100), but significantly surpasses HC in speed as increases. This comes from the cubic cost incurred by HC.
To present an overall view of GrangerBusca against all three competing methods (ADM4, HC and HG), in Figure 2 we show Precision@5, 10 and 20 scores for each approach on every Top100 dataset. A total of 26 (out of 27 possible) points are plotted in the figure. One single setting, HC with Memetracker, is absent since the model was unable o train sufficient time. The axis of the figure presents the Precision@n score for the baseline. Similarly, the axis shows the Precision@n score for GrangerBusca. We also show three regions where either GrangerBusca or competitors perform worse than a Null model. This model was created by performing uniformly random rankings. Notice that for Precision@5 and Precision@10, GrangerBusca outperforms most baselines on a large fraction of the datasets. In fact, for Precision@5, there is only one setting where ADM4 outperforms GrangerBusca. As the precision cutoff increases, so does the efficacy of the Null model (i.e., it easier for a random ranking to recover top edges). For Precision@20, there does exists some cases where GrangerBusca is outperformed by baseline methods. However, the majority of these cases exist in the region where both models are below the efficacy of a Null model.
Why does the model work? Recall that a Wold process is an adequate model when there is a strong dependency between consecutive interevent times and . To explain our results, we measured the correlation between and for pairs of interacting processes from the groundtruth data. Out of nine datasets, the worst case had the median Pearson correlation per pair equal to 0.3, a moderate value. However, in the remaining eight datasets this median was above 0.70, indicating the adequacy of a Wold model. The high linear dependency implies that . Thus, is a linear function, , of the previous interevent, justifying the intensity: (see Section 4 for a discussion on how to learn other forms).
It is also important to understand the limitations of nonparametric methods such as HC. Recall that HC relies on the statistical moments (e.g., mean/variance) of the interevent times
Achab2017 . Given that web datasets exhibit long tails (with theoretical distributions exhibiting high, or even infinite, variance), such moments will not accurately capture the underlying behavior of the dataset (see Section 2).6 Conclusions and Future Work
In this work, we present the first method to extract Granger causality matrices via Wold Processes. Though it was proposed over sixty years ago, this framework of point processes remain largely unexplored by the machine learning community. Our approach, called GrangerBusca, outperforms recent baseline methods Achab2017 ; Xu2016 ; Zhou2013 both in training time and in overall accuracy. GrangerBusca works particularly well when extracting the top connections of a node (Precision@5, 10, 20).
Given the efficacy of GrangerBusca, our hope is that current results may open up a new class of models, Wold processes, to be explored by the machine learning community. Finally, GrangerBusca can be used to explore real world behavior in complex large scale datasets.
Acknowledgements
We thank Fabricio Murai and the anonymous reviewers for providing comments. We also thank Gabriel Coutinho for discussions on the mathematical properties of GrangerBusca
, as well as Alexandre Souza for providing pointers to prior studies. This work has been partially supported by the project ATMOSPHERE (atmosphereeubrazil.eu), funded by the Brazilian Ministry of Science, Technology and Innovation (Project 51119  MCTI/RNP 4th Coordinated Call) and by the European Commission under the Cooperation Programme, Horizon 2020 grant agreement no 777154. Funding was also provided by the authors’ individual grants from CNPq, CAPES and Fapemig. Computational resources were provided by the Microsoft Azure for Data Science Research Award (CRM:0740801).
References
 [1] M. Achab, E. Bacry, S. Gaïffas, I. Mastromatteo, and J.F. Muzy. Uncovering causality from multivariate hawkes integrated cumulants. In ICML, 2017.
 [2] R. Alves, R. Assunção, and P. O. S. Vaz de Melo. Burstiness scale: A parsimonious model for characterizing random series of events. In KDD, 2016.
 [3] R. H. ArpaciDusseau and A. C. ArpaciDusseau. Operating Systems: Three Easy Pieces. ArpaciDusseau Books, 0.91 edition, 2015.
 [4] E. Bacry, I. Mastromatteo, and J.F. Muzy. Hawkes processes in finance. Market Microstructure and Liquidity, 1(01), 2015.
 [5] Y. Bao, Z. Kuang, P. Peissig, D. Page, and R. Willett. Hawkes process modeling of adverse drug reactions with longitudinal observational data. In ML for HC, 2017.
 [6] A.L. Barabási and M. Pósfai. Network science. Cambridge University Press, Cambridge, 2016.
 [7] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyperparameter optimization. In NIPS, 2011.
 [8] C. Blundell, J. Beck, and K. A. Heller. Modeling reciprocating relationships with hawkes processes. In NIPS, 2012.
 [9] S. Chen, A. Shojaie, E. SheaBrown, and D. Witten. The multivariate hawkes process in high dimensions: Beyond mutual excitation. arXiv preprint arXiv:1707.04928, 2017.
 [10] J. Chevallier, M. J. Cáceres, M. Doumic, and P. ReynaudBouret. Microscopic approach of a time elapsed neural model. Mathematical Models and Methods in Applied Sciences, 25(14), 2015.
 [11] E. Choi, N. Du, R. Chen, L. Song, and J. Sun. Constructing disease network and temporal progression model via contextsensitive hawkes process. In ICDM, 2015.
 [12] D. R. Cox. Some statistical methods connected with series of events. Journal of the Royal Statistical Society. Series B (Methodological), 1955.
 [13] D. Daley. Stationary point processes by markovdependent intervals and infinite intensity. Journal of Applied Probability, 19(A), 1982.
 [14] D. J. Daley and D. VereJones. An introduction to the theory of point processes, vol. 1. Springer, New York, 2003.
 [15] M. Dhamala, G. Rangarajan, and M. Ding. Estimating granger causality from fourier and wavelet transforms of time series data. Physical review letters, 100(1), 2008.
 [16] V. Didelez. Graphical models for marked point processes based on local independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 2008.
 [17] N. Du, M. Farajtabar, A. Ahmed, A. J. Smola, and L. Song. Dirichlethawkes processes with applications to clustering continuoustime document streams. In KDD, 2015.
 [18] J. Etesami, N. Kiyavash, K. Zhang, and K. Singhal. Learning network of multivariate hawkes processes: A time series approach. In UAI, 2016.
 [19] P. M. Fenwick. A new data structure for cumulative frequency tables. Software: Practice and Experience, 24(3), 1994.
 [20] C. W. Granger. Investigating causal relations by econometric models and crossspectral methods. Econometrica: Journal of the Econometric Society, 1969.
 [21] P. Guttorp and T. L. Thorarinsdottir. What happened to discrete chaos, the quenouille process, and the sharp markov property? some history of stochastic point processes. International Statistical Review, 80(2), 2012.
 [22] D. Hallac, Y. Park, S. Boyd, and J. Leskovec. Network inference via the timevarying graphical lasso. In KDD, 2017.
 [23] W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970.
 [24] A. G. Hawkes. Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society. Series B (Methodological), 1971.
 [25] A. G. Hawkes. Spectra of some selfexciting and mutually exciting point processes. Biometrika, 58(1), 1971.
 [26] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 1990.
 [27] V. Isham. A markov construction for a multidimensional point process. Journal of Applied Probability, 14(3), 1977.
 [28] S. Kumar, F. Spezzano, V. Subrahmanian, and C. Faloutsos. Edge weight prediction in weighted signed networks. In ICDM, 2016.
 [29] J. Leskovec, L. Backstrom, and J. Kleinberg. Memetracking and the dynamics of the news cycle. In KDD, 2009.
 [30] J. Leskovec, D. P. Huttenlocher, and J. M. Kleinberg. Governance in social media: A case study of the wikipedia promotion process. In ICWSM, 2010.
 [31] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 2007.
 [32] A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. Reducing the sampling complexity of topic models. In KDD, 2014.
 [33] S. Li, X. Gao, W. Bao, and G. Chen. Fmhawkes: A hawkes process based approach for modeling online activity correlations. In CIKM, 2017.
 [34] S. Linderman and R. Adams. Discovering latent network structure in point process data. In ICML, 2014.
 [35] S. Linderman and R. Adams. Scalable bayesian inference for excitatory point process networks. arXiv preprint arXiv:1507.03228, 2015.
 [36] R. P. Monti, P. Hellyer, D. Sharp, R. Leech, C. Anagnostopoulos, and G. Montana. Estimating timevarying brain connectivity networks from functional mri time series. NeuroImage, 103, 2014.
 [37] A. Namaki, A. Shirazi, R. Raei, and G. Jafari. Network analysis of a financial market based on genuine correlation and threshold method. Physica A: Statistical Mechanics and its Applications, 390(21), 2011.
 [38] Y. Ogata. On lewis’ simulation method for point processes. IEEE Transactions on Information Theory, 27(1):23–31, 1981.
 [39] P. Panzarasa, T. Opsahl, and K. M. Carley. Patterns and dynamics of users’ behavior and interaction: Network analysis of an online community. Journal of the Association for Information Science and Technology, 60(5), 2009.
 [40] A. Paranjape, A. R. Benson, and J. Leskovec. Motifs in temporal networks. In WSDM, 2017.
 [41] M.A. Rizoiu, Y. Lee, S. Mishra, and L. Xie. Hawkes processes for events in social media. In S.F. Chang, editor, Frontiers of Multimedia Research. ACM, 2018.

[42]
J. Silva and R. Willett.
Hypergraphbased anomaly detection of highdimensional cooccurrences.
IEEE transactions on pattern analysis and machine intelligence, 31(3), 2009.  [43] M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of latent semantic analysis, 427(7), 2007.
 [44] A. Terenin and E. P. Xing. Techniques for proving asynchronous convergence results for markov chain monte carlo methods. In NIPS, 2017.
 [45] P. O. S. Vaz de Melo, C. Faloutsos, R. Assunção, R. Alves, and A. A. Loureiro. Universal and distinct properties of communication dynamics: how to generate realistic interevent times. ACM Transactions on Knowledge Discovery from Data (TKDD), 9(3), 2015.
 [46] P. O. S. Vaz de Melo, C. Faloutsos, R. Assunção, and A. Loureiro. The selffeeding process: A unifying model for communication dynamics in the web. In WWW, 2013.
 [47] H. Wold. On stationary point processes and markov chains. Scandinavian Actuarial Journal, 1948(12), 1948.
 [48] H. Xu, M. Farajtabar, and H. Zha. Learning granger causality for hawkes processes. In ICML, 2016.
 [49] Y. Yang, J. Etesami, N. He, and N. Kiyavash. Online learning for multivariate hawkes processes. In NIPS, 2017.
 [50] H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich. Local higherorder graph clustering. In KDD, 2017.
 [51] H.F. Yu, C.J. Hsieh, H. Yun, S. Vishwanathan, and I. S. Dhillon. A scalable asynchronous distributed algorithm for topic modeling. In WWW, 2015.
 [52] J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.Y. Liu, and W.Y. Ma. Lightlda: Big topic models on modest computer clusters. In WWW, 2015.
 [53] K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse lowrank networks using multidimensional hawkes processes. In AISTATS, 2013.
Appendix A Simulating GrangerBusca
In Algorithm 1 we show how Ogata’s Modified Thinning algorithm [38] is adapted for GrangerBusca. We initially point out that some care has to be taken for the initial simulated timestamps. Given that (the previous observation) does not exist, the algorithm will need to either start with a synthetic initial increment of fall back to the Poisson rate. In the algorithm, the rate of each individual process is computed. Then, a new observation is generated based on the sum of such rates. Given that each process will behave like a Poisson process while a new event does not surface (Figure 1), the sum of these processes is also a Poisson process. Lastly, we employ a multinomial sampling to determine the process from which the observation belongs to.
Appendix B Log Likelihood
We now derive the log likelihood of GrangerBusca for parameters . For a point process with intensity , the likelihood can be computed as [14]:
(7) 
Considering the intensity of each process separately, we can write the log likelihood as:
(8)  
Here, is the last event from . The expansion of the integral comes from the stepwise nature of (see Figure 1). For simplicity, let us initially assume that there is only one process. As discussed in the paper, computing has a cost. Due to summations of the form, , the cost to evaluate is . is the cost to evaluate for every observation.
Now, let us return to the case of multiple processes. Let be the number of events for process . Next, is the number of events in the processes with the most of such a number. That is, . The cost of , naively, will be of . This comes from the summation: . To simplify the comparison with past methods, in our manuscript we did not detail our runtime cost in terms of . Strictly speaking, our fitting algorithm with the MCMC sampler performs at a cost of: .
Appendix C Fitting Algorithm
The algorithm is shown in Algorithm 2, with the Estep being detailed in Algorithm 3. The maximization step, for GrangerBusca in particular, is a MLE estimation for a Poisson process. The pseudocode shown here is not parallel and builds the F+Tree naively. By updating using a sloppy counter (see Chapter 11 of [3]) across processing cores, one only needs to iterate over to compute . The counter consists of a local count of for each processor. After a certain number of steps, say at every iterations, is synced with a master parameter server.
The runtime of the algorithm may be optimized by either precomputing or caching for every observation from every process. Nevertheless, this precomputation comes at a memory cost of being likely is prohibitive for larger datasets. We can however cache a small subset of such values to amortize the cost down to for cache hits. Secondly, the cost can also be amortized with an AliasTable. With these two optimizations, it is possible to implement optimized versions of the sampling algorithm that execute at a amortized cost per iteration.
Comments
There are no comments yet.