Predicting Scientific Citations with Graph Neural Networks

Michał Jurzak

Citations are a strong, if noisy, proxy for a paper’s influence. This project forecasts the future citation count of academic papers from the structure of the citation graph and the semantic content of their abstracts, with the practical aim of surfacing emerging work before it becomes mainstream.

Data

The work uses the Cit-HepTh citation network from Stanford SNAP — high-energy physics theory papers from arXiv, January 1993 to April 2003.

Features

Each node is described by a mix of content and graph-structural features:

Method

The core model is a message-passing graph neural network. For node v at layer \ell, the representation is updated by aggregating over its neighbourhood \mathcal{N}(v),

h_v^{(\ell+1)} = \sigma\!\left( W^{(\ell)} h_v^{(\ell)} + \sum_{u \in \mathcal{N}(v)} \frac{1}{\sqrt{d_u d_v}}\, W^{(\ell)} h_u^{(\ell)} \right),

where d_v is the degree of v and \sigma a non-linearity. Six architectures were compared: GCN, GraphSAGE, GAT, a Hybrid SAGE + GAT, a Wide & Deep GAT with parallel graph and MLP paths, and an EvolvingGNN that threads a GRU through successive graph snapshots for temporal dynamics.

Targets are heavily skewed, so the model regresses the transformed count \tilde{y} = \log(1 + y). Training uses temporal splits (train 1999, validate 2000, test 2001) with multi-year sliding windows, plus dropout, layer normalisation, gradient clipping, and early stopping.

Results

On the sliding-window regression task, GraphSAGE was the strongest single model:

A persistent “integer gap” remained: no model reliably separated papers with 12 citations from those with 39, which behaves largely as stochastic noise.

Source

Code, notebooks, and the full set of figures are in the source repository.