Run Infomap on HPC

Infomap trials are independent work units. That makes them a good fit for HPC: run many trials on one node with --parallel-trials, or split trials across a scheduler job array with --trial-offset and merge the shard results afterwards.

This notebook uses Python as orchestration glue. The expensive work runs in the native Infomap command-line binary. The final merge uses the Python helper python -m infomap.merge, which reads shard JSON files and copies the winning tree without rerunning Infomap.

Build for your cluster

Start by checking which compiler, OpenMP runtime, and flags Infomap will use:

make doctor

A portable OpenMP build is the usual starting point:

make build-native OPENMP=1

For a homogeneous partition where build and run nodes have the same CPU type, you can opt into node-local tuning:

make build-native OPENMP=1 NATIVE_ARCH=1

NATIVE_ARCH=1 enables non-portable native tuning such as -march=native, link-time optimization, and loop unrolling. Build it on the same kind of node where you will run it. If OpenMP is hard to configure on your cluster, build without it and scale with job arrays instead:

make build-native OPENMP=0
from __future__ import annotations

import json
import math
import os
import subprocess
import sys
import tempfile
from pathlib import Path

for candidate in (Path.cwd(), *Path.cwd().parents):
    if (candidate / "src" / "main.cpp").exists():
        repo = candidate
        break
else:
    raise RuntimeError("Run this notebook from an Infomap source checkout.")

infomap = repo / ("Infomap.exe" if os.name == "nt" else "Infomap")
assert infomap.exists(), "Run make build-native first."

network = repo / "examples" / "networks" / "ninetriangles.net"
_work_dir = tempfile.TemporaryDirectory(prefix="infomap-hpc-")
work = Path(_work_dir.name)

print("Infomap binary: ./Infomap")
print("Network: examples/networks/ninetriangles.net")
print("Work directory: temporary notebook directory")
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[1], line 19
     15 else:
     16     raise RuntimeError("Run this notebook from an Infomap source checkout.")
     17 
     18 infomap = repo / ("Infomap.exe" if os.name == "nt" else "Infomap")
---> 19 assert infomap.exists(), "Run make build-native first."
     20 
     21 network = repo / "examples" / "networks" / "ninetriangles.net"
     22 _work_dir = tempfile.TemporaryDirectory(prefix="infomap-hpc-")

AssertionError: Run make build-native first.

Single-node recipe

Use --parallel-trials when one allocated node has enough cores and memory for the whole run. --num-threads auto picks up scheduler limits such as SLURM_CPUS_PER_TASK, so the process does not silently use every core on a shared node.

single_dir = work / "single-node"
single_dir.mkdir(exist_ok=True)

subprocess.run(
    [
        str(infomap),
        str(network),
        str(single_dir),
        "--num-trials", "6",
        "--parallel-trials",
        "--num-threads", "auto",
        "--seed", "123",
        "--timing-json", str(single_dir / "timing.json"),
        "--summary-json", str(single_dir / "summary.json"),
        "--manifest-json", str(single_dir / "manifest.json"),
        "--silent",
    ],
    env={**os.environ, "SLURM_CPUS_PER_TASK": "2"},
    check=True,
)

timing = json.loads((single_dir / "timing.json").read_text())
summary = json.loads((single_dir / "summary.json").read_text())
manifest = json.loads((single_dir / "manifest.json").read_text())

print("thread source:", timing["thread_source"])
print("threads used:", timing["threads_used"])
print("trial codelengths:", summary["trial_codelengths"])
print("config fingerprint:", manifest["config_fingerprint"])
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 3))
ax.plot(range(len(summary["trial_codelengths"])), summary["trial_codelengths"], marker="o")
ax.set_xlabel("Trial")
ax.set_ylabel("Codelength")
ax.set_title("Independent trial results")
ax.grid(True, alpha=0.3)
plt.show()

Distributed trials on one machine

The job-array version uses the same commands. This notebook simulates two array tasks locally. Each shard runs a disjoint range of global trial indices and writes one --trial-results JSON file.

The important rule is that the reference run also uses sharding mode. A plain serial --num-trials N run uses legacy continuous RNG behavior and is not the right comparison for deterministic sharding.

ref_dir = work / "reference"
shard0_dir = work / "shard-0"
shard1_dir = work / "shard-1"

for out_dir, trials, offset, result_name in [
    (ref_dir, 5, 0, "reference.json"),
    (shard0_dir, 2, 0, "results_0.json"),
    (shard1_dir, 3, 2, "results_1.json"),
]:
    out_dir.mkdir(exist_ok=True)
    subprocess.run(
        [
            str(infomap),
            str(network),
            str(out_dir),
            "--num-trials", str(trials),
            "--trial-offset", str(offset),
            "--seed", "123",
            "--num-threads", "1",
            "--trial-results", str(out_dir / result_name),
            "--silent",
        ],
        check=True,
    )

reference_data = json.loads((ref_dir / "reference.json").read_text())
shard0_data = json.loads((shard0_dir / "results_0.json").read_text())
shard1_data = json.loads((shard1_dir / "results_1.json").read_text())

reference_vector = [
    float(trial["codelength"])
    for trial in sorted(reference_data["trials"], key=lambda trial: int(trial["trial"]))
]
shard_vector = [
    float(trial["codelength"])
    for shard in (shard0_data, shard1_data)
    for trial in sorted(shard["trials"], key=lambda trial: int(trial["trial"]))
]

assert reference_vector == shard_vector
print("reference vector:", reference_vector)
print("shard vector:    ", shard_vector)

Merge shard results with Python

Merging is intentionally a Python post-processing step. It is cheap compared with the Infomap runs: it reads shard JSON, verifies matching fingerprints, selects the lowest-codelength global trial, and writes tree / clu output from the winning tree.

merge_dir = work / "merge"
merge_dir.mkdir(exist_ok=True)

subprocess.run(
    [
        sys.executable,
        "-m", "infomap.merge",
        str(shard0_dir / "results_0.json"),
        str(shard1_dir / "results_1.json"),
        "--out-name", str(merge_dir / "final"),
        "--output", "tree,clu",
        "--require-complete-trials",
    ],
    check=True,
    stdout=subprocess.DEVNULL,
)

assert (merge_dir / "final.tree").exists()
assert (merge_dir / "final.clu").exists()

trials = shard0_data["trials"] + shard1_data["trials"]
winner = min(trials, key=lambda trial: (float(trial["codelength"]), int(trial["trial"])))
assert math.isclose(float(winner["codelength"]), min(reference_vector))

print("merged outputs: merge/final.tree, merge/final.clu")
print("winning global trial:", winner["trial"])
print("winning codelength:", winner["codelength"])

Programmatic merge uses the same implementation:

from infomap.merge import merge_trial_results

summary = merge_trial_results(
    ["results_*.json"],
    out_name="final",
    formats=("tree", "clu"),
    require_complete=True,
)

SLURM recipe

The array job runs the native binary. --num-trials is the per-shard count. --trial-offset maps each array task to a global trial range.

#!/usr/bin/env bash
#SBATCH --job-name=infomap-shards
#SBATCH --array=0-3
#SBATCH --cpus-per-task=8
#SBATCH --time=02:00:00
#SBATCH --output=logs/infomap_%A_%a.out
#SBATCH --error=logs/infomap_%A_%a.err

set -euo pipefail

INFOMAP=/path/to/Infomap
NETWORK=/path/to/graph.net
OUT=/path/to/out
TRIALS_PER_SHARD=25
OFFSET=$((SLURM_ARRAY_TASK_ID * TRIALS_PER_SHARD))

mkdir -p "$OUT/shards/$SLURM_ARRAY_TASK_ID"

echo "Shard: $SLURM_ARRAY_TASK_ID"
echo "Trial offset: $OFFSET"
echo "Trials: $TRIALS_PER_SHARD"
echo "CPUs: $SLURM_CPUS_PER_TASK"
date

srun "$INFOMAP" "$NETWORK" "$OUT/shards/$SLURM_ARRAY_TASK_ID" \
  --num-trials "$TRIALS_PER_SHARD" \
  --trial-offset "$OFFSET" \
  --seed 123 \
  --num-threads auto \
  --trial-results "$OUT/results_${SLURM_ARRAY_TASK_ID}.json" \
  --no-final-output

Use the job id from sbatch and the array task id you want to inspect. Monitor one shard’s stdout live, then let the merge check completeness:

squeue -j <job-id>
tail -f logs/infomap_<job-id>_<task-id>.out
sacct -j <job-id> --format=JobID,State,ExitCode,Elapsed,MaxRSS

After the array completes, run the merge as a small Python post-processing job:

python -m infomap.merge '/path/to/out/results_*.json' \
  --out-name /path/to/out/final \
  --output tree,clu \
  --require-complete-trials

Before scaling

  • Use the same input, seed, and algorithm options for all shards.

  • Treat --num-trials as the per-shard count.

  • Use non-overlapping --trial-offset ranges.

  • Keep the reference comparison in sharding mode.

  • Use --require-complete-trials when gaps should fail the merge.

  • Check exit codes in batch scripts. Input and output failures should not look like successful runs.

  • Remember that merge can write tree and clu; link-bearing formats need a full Infomap run.