Predicting Scientific Citations with Graph Neural Networks

Michał Jurzak

Citations are a strong, if noisy, proxy for a paper’s influence. This project forecasts the future citation count of academic papers from the structure of the citation graph and the semantic content of their abstracts, with the practical aim of surfacing emerging work before it becomes mainstream.

Data

The work uses the Cit-HepTh citation network from Stanford SNAP — high-energy physics theory papers from arXiv, January 1993 to April 2003.

27,770 papers (nodes) and 352,807 citations (edges).
Per-paper abstracts for semantic features and timestamps for temporal modelling.

Features

Each node is described by a mix of content and graph-structural features:

Abstract embeddings — 384-dimensional vectors from the all-MiniLM-L6-v2 sentence transformer.
In-degree / out-degree — citations received and references made.
PageRank — a global importance score over the citation graph.
Velocity — citations accrued in the previous year.

Method

The core model is a message-passing graph neural network. For node v at layer \ell, the representation is updated by aggregating over its neighbourhood \mathcal{N}(v),

h_v^{(\ell+1)} = \sigma\!\left( W^{(\ell)} h_v^{(\ell)} + \sum_{u \in \mathcal{N}(v)} \frac{1}{\sqrt{d_u d_v}}\, W^{(\ell)} h_u^{(\ell)} \right),

where d_v is the degree of v and \sigma a non-linearity. Six architectures were compared: GCN, GraphSAGE, GAT, a Hybrid SAGE + GAT, a Wide & Deep GAT with parallel graph and MLP paths, and an EvolvingGNN that threads a GRU through successive graph snapshots for temporal dynamics.

Targets are heavily skewed, so the model regresses the transformed count \tilde{y} = \log(1 + y). Training uses temporal splits (train 1999, validate 2000, test 2001) with multi-year sliding windows, plus dropout, layer normalisation, gradient clipping, and early stopping.

Results

On the sliding-window regression task, GraphSAGE was the strongest single model:

SAGE: MAE 1.64, RMSE 5.67, R^2 = 0.37, Spearman 0.63 — versus a mean baseline at R^2 \approx 0.
Recasting the problem as classification into five impact bands proved more useful in practice. SAGE recovered 58% of high-impact papers (40+ citations), while GCN was the best conservative filter, catching 68% of zero-citation papers.

A persistent “integer gap” remained: no model reliably separated papers with 1–2 citations from those with 3–9, which behaves largely as stochastic noise.

Source

Code, notebooks, and the full set of figures are in the source repository.