It is kept for reference and is no longer updated.
For the newest version, check Infomap Online or Github.
Here we provide the source code to the Infomap algorithm for detecting communities in large networks with the map equation framework. See also GossipMap, a distributed version for billion-edge directed graphs implemented with GraphLab PowerGraph by Seung-Hee Bae and Bill Howe at University of Washington. We also provide significance clustering code.
Infomap is a command line software
written in C++ that runs in Linux, Mac OS X, and Windows with
gcc
installed.
Infomap requires a working
gcc
or clang
compiler.
In Ubuntu, for example, the metapackage
build-essential
installs the compiler as well as
related packages. Install it from the terminal with
sudo apt-get install build-essential
Since Mac OS X 10.9, the standard compiler tools are based on clang, which can be installed with
xcode-select --install
However, the current version lacks OpenMP support for
parallelization. While the Makefile automatically skips the
-fopenmap
compiler flag if the standard compiler is
clang, to get support for OpenMP you can manually install a
gcc-based compiler. A simple way is to install
Homebrew and type, for
example, brew install gcc43
in the terminal.
We recommend running Infomap in
bash on ubuntu on Windows 10. Follow
this installation guide
or
this installation guide
to enable the Linux Bash Shell in Windows 10. For example, install
git
with sudo apt-get install git
. Then
install the metapackage build-essential
for the
compiler and related packages with
sudo apt-get install build-essential
Without Windows 10, you can install MinGW/MSYS for a minimalist development environment. Follow the instructions on MinGW - Getting Started or download the complete MinGW-MSYS Bundle.
Infomap v0.x code can be downloaded here infomap-0.x.zip .
To extract the zipped archive and compile the source code in a Unix-like environment, open a terminal and type
cd [path/to/Infomap]
unzip infomap-0.x.zip
cd infomap-0.x
make
Substitute [path/to/Infomap]
to the folder where
Infomap was downloaded, such as ~/Downloads. With
git
installed, instead clone the source code with
git clone -b v0.x https://github.com/mapequation/infomap.git
.
If you installed a custom compiler with, for example,
brew install gcc43
, replace the last command with
CXX=g++-4.3 make
.
Compilation error due to missing parallelization library.
If there is a compilation error that references
-lomp
or -lpthread
, the compiler probably
can't find the OpenMP library which is used for
parallelization. To compile without the OpenMP library, type
make noomp
. You can also try to compile with another
compiler by calling for example CXX=g++-4.3 make
.
Please give us feedback if you have any troubles!
Usage: ./Infomap [options] network_data dest
The optional arguments can be put anywhere, see the
Options section for the available
options
. The network_data
should point to
a valid network file (see Input section) and
dest
to a directory where the Infomap should write its
output files.
option
is provided,
Infomap will assume an undirected
network and try to partition it hierarchically.
Run Infomap with the options -h
or
--help
to list the options below:
-d2N3
is equivalent to
-d -2 -N 3
.
Run Infomap with the options -hh
to list the options
below:
Argument types are defined as:
[+] means that the option can be repeated
Makefile
and
src
directory.
mkdir output
./Infomap ninetriangles.net output/ -N 10 --tree --bftree
Here, we first create a new directory where we want the put the
results, and feed that to
Infomap together with an input network
and an option. Now Infomap will try to
parse the file ninetriangles.net
as an
undirected network, partition it
hierarchically and write the best result out of
10
attempts to the output
directory, both
in the human readable .tree format and
the binary .bftree format, which can be
loaded into the
Network Navigator.
./Infomap ninetriangles.net output/ -N 10 --directed --two-level --clu --map
Here, Infomap will treat the network as directed and try to find the optimal two-level partition with respect to the flow. This will also create an output .clu file with the cluster indices, and a .map file with the modular network that can be visualized using the Map Generator.
Multiplex networks describes a set of networks with corresponding nodes but different link structures, and possibly also an inter-network link structure (see multiplex format). If no inter-link data is provided, or if the relax rate is explicitly set, the inter-link structure will be generated by relaxing the layer constraints of the intra-network links.
There are two ways to provide a multiplex network to Infomap:
./Infomap multiplex-network.net output/ -i multiplex
or
./Infomap network1.net network2.net [... networkN.net] output/
where network1.net
to networkN.net
are
ordinary networks in the Pajek or
link list format.
To see a minimal example on how Infomap can be run as a library,
check the files example.cpp
and
Makefile
in the
examples/cpp/minimal
directory, and run by typing:
cd examples/cpp/minimal
make
./example
The included directory
example/cpp/Infomap-igraph-interface-library/
contains
an example on how to write a small interface library to simplify
integration with other network libraries, here
igraph. Example code to use that is
included in the examples/cpp/igraph
directory.
brew install swig
.
To run Infomap from the included python example code, type:
cd examples/python
make
python example-networkx.py
The make
command first calls
make python
in the Infomap base directory, which uses
swig and
setuptools
to compile Infomap to a python module. Then that module is copied to
the current example directory to be easily imported in the included
python code.
The example-networkx.py
code depends on the
NetworkX module, which can
be installed by typing pip install networkx
using the
pip
tool.
To run Infomap from the included Jupyter notebook, first
compile Infomap to a python module as described above, then install Jupyter by typing
pip install jupyter
using the
pip
tool, and finally start the notebook server and open the
example notebook
in examples/python
by typing:
jupyter notebook infomap-examples.ipynb
Assuming R is installed, you can run Infomap from from the included R example code, type:
cd examples/R
make
R
source("example-minimal.R")
The make
command first calls make R
in the
Infomap base directory, which uses
swig to generate wrapper code for R and
R CMD SHLIB
to compile Infomap as a shared library. The
library can then be loaded in an R
environment by
calling source("load-infomap.R")
, which is
called by the example files.
The example-igraph.R
code gives a more complete
example, depending on the
igraph R package to generate a
network and to plot the communities found by Infomap.
Infomap decomposes a network into modules by optimally compressing a description of information flows on the network. To model this flow of information, we release a random walker on the network and encodes its steps and module exits. To tune the modular solution, you can therefore tune the dynamics of the random walker, or the encoding scheme.
Some relevant options here are --directed
,
--undirdir
, --rawdir
,
--teleportation-probability
,
--to-nodes
and --include-self-links
.
Relevant options are --recorded-teleportation
and
--markov-time
. With
--recorded-teleportation
, the solution captures the
blurring effect of the random teleportation, typically needed on
directed networks to reach a stationary solution. To minimize this
artificial effect, we therefore no longer use the option by default.
The --markov-time
defines how many steps the random
walker takes before its position is encoded (default 1). Shorter
Markov times than 1 mean that the average transition rate of a
random walker is lower than the encoding rate of its position, such
that the same node will be encoded multiple times in a row. As a
result, the map equation will favor more and smaller modules.
Oppositely, longer Markov times mean that the average transition
rate is higher than the encoding rate, such that not every node on
the trajectory will be encoded, and the map equation will favor
fewer and larger modules. Read more in
Efficient community detection of network flows for varying Markov
times and bipartite networks.
Currently, Infomap understands three
different file formats for the input network data, a minimal
link list format, the standard
Pajek format, and a 3-gram format for
second-order dynamics. Specify the input format to
Infomap with the -i
or
--input-format
flag together with pajek
or
link-list
as its argument. If no input format is
specified, Infomap will try to parse
the file by this assumption:
.net
- Pajek format
.txt
-
Link list format
0
, you
have to add the flag -z
or --zero-based
to
Infomap, otherwise it will assume that
the node numbering starts from 1
.
A link list is a minimal format to describe a network by only specifying a set of links:
# A network in link list format
1 2 1
1 3 1
2 3 2
3 5 0.5
...
Each line corresponds to the triad
source target weight
which describes a weighted link
between the nodes with specified numbers. The
weight
can be any non-negative value. If omitted, the
link weight will default to 1
. The first node are
assumed to start from 1
* and the total
number of nodes will be determined by the maximum node number.
The Pajek format specifies both the nodes and the links in two different sections of the file:
# A network in Pajek format
*Vertices 27
1 "1"
2 "2"
3 "3"
4 "4"
...
*Edges 33
1 2 1
1 3 1
1 4 1
2 3 1
...
The node section must start with *Vertices N
and the
link section with *Edges N
or
*Arcs N
(case insensitive), where N
is the
number of nodes or links. The characters within each quotation mark
defines the corresponding node name. Weights can be given to nodes
by adding a third column with positive numbers, and the list of
links are treated the same as for a
link list.
The 3-gram format contains information about second-order dynamics
and is similar to the pajek format but
the links are defined by at least three columns,
from through to [weight]
:
# A network in 3-gram format
*Vertices 27
1 "1"
2 "2"
3 "3"
4 "4"
...
*3grams 33
1 1 2 1
2 1 2 1
3 1 2 1
1 2 1 1
...
The node section format follows exactly the Pajek format, but the
link section must begin with *3grams N
or
*Arcs N
(case insensitive), where N
is the
number of links. Because pathways are inherently directed, the links
are assumed to be directed by default.
You can mix the trigrams with bigrams, i.e. ordinary links, by
setting the first column to -1
. Infomap will then try
to match those links with the existing trigrams to spread out the
weight proportinally to the weigh of the matched links.
There are three ways to match an incomplete link
-1 2 3
, in priority order:
x 2 3
-> x 2 3
x 2 y
-> x 2 3
x y 2
-> y 2 3
If no matches are found, a self-link 2 2 3
is created
if self-links are allowed.
In a multiplex network, each physical node can exist in a number of layers, with different link structure for each layer. A general multiplex format follows the Pajek layout, but with the links defined between nodes for each layer:
# A network in a general multiplex format
*Vertices 4
1 "Node 1"
2 "Node 2"
3 "Node 3"
4 "Node 4"
*Multiplex
# layer node layer node [weight]
1 1 1 2 2
1 1 2 2 1
1 2 1 1 1
1 3 2 2 1
2 2 1 3 1
2 3 2 2 1
2 4 2 1 2
2 4 1 2 1
The weight
column is optional. Links without the last
column will get weight 1.0
by default.
If no heading (beginning with *
) is given, the
multiplex format assumes *Multiplex
link format:
# A network in a general multiplex format
# layer node layer node [weight]
1 1 1 2 2
1 1 2 2 1
1 2 1 1 1
1 3 2 2 1
2 2 1 3 1
2 3 2 2 1
2 4 2 1 2
2 4 1 2 1
The multiplex format above gives full control of the dynamics, and
no other movements are encoded. However, it is often useful to
consider a dynamics in which a random walker moves within a layer
and with a given relax rate jumps to another layer without recording
this movement, such that the constraints from moving in different
layers can be gradually relaxed. The format below explicitely
divides the links into two groups, links within layers
(intra-layer links) and links between layers (inter-layer
links). For the first group, the second layer column can be omitted.
For the second group, the second node can be omitted, as a shorthand
for an unrecorded jump between layers. That is, each
inter-layer link layer1 node1 layer2
is expanded to
weighted multiplex links layer1 node1 layer2 node2
, one
for each node2
that node1
is connected to
in layer2
with weight proportional to the weight of the
link between node1
and node2
in
layer2
. In this way, the random walker seamlessly
switches to a different layer at a rate proportional to the
inter-layer link weight to that layer, and the encoded dynamics
correspond to relaxed layer constraints (see the
interactive storyboard
for illustration). To define links like this, use the
*Intra
and *Inter
headings:
# A network in multiplex format
*Intra
# layer node node weight
1 1 2 1
1 2 1 1
1 2 3 1
1 3 2 1
1 3 1 1
1 1 3 1
1 2 4 1
1 4 2 1
2 4 5 1
2 5 4 1
2 5 6 1
2 6 5 1
2 6 4 1
2 4 6 1
2 3 6 1
2 6 3 1
*Inter
# layer node layer weight
1 3 2
2 3 1
1 4 2
2 4 1
If no inter links are provided, the inter links will be generated from the intra link structure by relaxing the layer constraints on those links.
The bipartite network format uses prefixes f
and
n
for features and nodes, respectively. The bipartite
network can be provided both with node names:
# A bipartite network with node names
*Vertices 5
1 "Node 1"
2 "Node 2"
3 "Node 3"
4 "Node 4"
5 "Node 5"
*Edges 6
# feature node [weight]
f1 n1 2
f1 n2 2
f1 n3 1
f2 n3 1
f2 n4 2
f2 n5 2
and in the link list format:
# A bipartite network in link list format
# feature node [weight]
f1 n1 2
f1 n2 2
f1 n3 1
f2 n3 1
f2 n4 2
f2 n5 2
Node names can be provided under f
and
n
for features and nodes, respectively. In the link
list format, for example, it takes the form:
The state format describes the exact network used internally by Infomap. It can model both ordinary networks and memory networks of variable order as illustrated in this notebook.
# A network in state format
*Vertices 4
1 "PRE"
2 "SCIENCE"
3 "PRL"
4 "BIO"
*States
1 2 "1 2"
2 3 "2 3"
3 2 "4 2"
4 4 "2 4"
*Links
1 2
3 4
The *Vertices
section is optional and follows the
Pajek format. It is used to name the
physical nodes. The *States
section describes all
internal nodes with the format
uniqueId physicalId [name]
, where name
is
optional. The unique index is referenced in the
*Links
section, and the physical index of the node is
optionally referenced in the *Vertices
section. The
*Links
section describes all links as
from to [weight]
, where from
and
to
references the first index in the
*States
section.
The example state network above is equivalent to the 3-gram network below:
# A 3-gram network equivalent to the state network above
*Vertices 4
1 "PRE"
2 "SCIENCE"
3 "PRL"
4 "BIO"
*3grams
1 2 3
4 2 4
The resulting hierarchy will be written to a file with the extension
.tree
(plain text file) and corresponds to the best
hierarchical partition (shortest description length) of the
attempts. The output format has the pattern:
# Codelength = 3.48419 bits.
1:1:1 0.0384615 "7" 6
1:1:2 0.0384615 "8" 7
1:1:3 0.0384615 "9" 8
1:2:1 0.0384615 "4" 3
1:2:2 0.0384615 "5" 4
...
Each row (except the first one, which summarizes the result) begins with the multilevel module assignments of a node. The module assignments are colon separated from coarse to fine level, and all modules within each level are sorted by the total flow (PageRank) of the nodes they contain. Further, the integer after the last colon is the rank within the finest-level module, the decimal number is the amount of flow in that node, i.e. the steady state population of random walkers, the content within quotation marks is the node name, and finally, the last integer is the index of the node in the original network file.
The Tree format with an appended section
of links with the flow between nodes within the same parent. The
file is saved with the extension .ftree
(plain text
file), and begins with the node hierarchy formatted as the
Tree format, followed by the links
formatted as:
*Links undirected
#*Links path exitFlow numEdges numChildren
*Links root 0.0 6 5
1 2 0.0128205
1 3 0.0128205
2 4 0.0128205
3 4 0.0128205
3 5 0.0128205
4 5 0.0128205
*Links 1 0.025641 3 3
1 2 0.0128205
1 3 0.0128205
2 3 0.0128205
*Links 1:1 0.0384615 3 3
1 2 0.0128205
1 3 0.0128205
2 3 0.0128205
*Links 1:2 0.0384615 3 3
1 2 0.0128205
1 3 0.0128205
2 3 0.0128205
...
First is a line stating undirected
or
directed
links. Then each module is identified by the
path
field, followed by all links between child nodes
of that module. The path
is a colon separated path in
the tree from the root
to finest-level modules. The
links are sorted by flow within each module. Links exiting the
module are not included, but the flow on those links is aggregated
to exitFlow
. The number of edges and child nodes within
each module is also included in the module header line, as defined
in the commented line above. (Lines starting with #
are
treated as comments and are ignored when parsed.)
A .map
file (plain text file) describes the best
two-level partition (shortest description length) of the
attempts. The format has the pattern:
# modules: 4
# modulelinks: 4
# nodes: 16
# links: 20
# codelength: 3.32773
*Directed
*Modules 4
1 "Node 13,..." 0.25 0.0395432
2 "Node 5,..." 0.25 0.0395432
3 "Node 9,..." 0.25 0.0395432
4 "Node 1,..." 0.25 0.0395432
*Nodes 16
1:1 "Node 13" 0.0820133
1:2 "Node 14" 0.0790863
1:3 "Node 16" 0.0459137
1:4 "Node 15" 0.0429867
2:1 "Node 5" 0.0820133
2:2 "Node 6" 0.0790863
2:3 "Node 8" 0.0459137
2:4 "Node 7" 0.0429867
3:1 "Node 9" 0.0820133
3:2 "Node 10" 0.0790863
3:3 "Node 12" 0.0459137
3:4 "Node 11" 0.0429867
4:1 "Node 1" 0.0820133
4:2 "Node 2" 0.0790863
4:3 "Node 4" 0.0459137
4:4 "Node 3" 0.0429867
*Links 4
1 4 0.0395432
2 3 0.0395432
3 1 0.0395432
4 2 0.0395432
...
This file contains the necessary information to generate a visual
map in the Map Generator. The
names under *Modules
are derived from the node with the
highest flow volume within the module and
0.25 0.0395432
represent, respectively, the aggregated
flow volume of all nodes within the module and the per step exit
flow from the module. The nodes are listed with their module
assignments together with their flow volumes. Finally, all links
between the modules are listed in order from high flow to low flow.
A .clu
file (plain text file) describes the best
two-level partition (shortest description length) of the
attempts. The format has the pattern:
# node cluster flow:
4 1 0.4
3 1 0.1
1 2 0.2
2 2 0.1
5 2 0.05
8 2 0.05
7 3 0.05
6 3 0.05
If the .clu
file is used as an input clustering to
Infomap or the apps, the
flow
column will not be used and may be omitted. Even
the node
column may be omitted, in which case it
assumes an implicit column of naturally ordered integers starting
from one.
The resulting hierarchy can be written to a compact binary format
using the --btree
or --bftree
flag. The
former includes only the tree, while the latter also includes the
horizontal flow links between nodes on each level of the
tree. The data is written in a breath-first pattern to be
streamable, which the
Network Navigator utilizes
to be able to show the top modular structure of the network before
the deeper parts of the tree are loaded.
Don't hesitate to contact us if you have questions or comments about the code.
Infomap optimizes the map equation, which exploits the information-theoretic duality between the problem of compressing data, and the problem of detecting and extracting significant patterns or structures within those data.
Specifically, the map equation is a flow-based method and operates on dynamics on the network.
Infomap handles both unweighted and weighted, undirected and directed links.
Infomap clusters tightly interconnected nodes into modules (two-level clustering) or the optimal number of nested modules (multi-level clustering).
Infomap captures flow patterns modeled with both first-order dynamics (as on a conventional network: where flow moves to on the network only depends on where it currently is) and second-order dynamics (where flow moves to on the network both depends on where it currently is and where it just came from). Infomap captures second-order dynamics by performing first-order dynamics on memory nodes, see Memory in network flows and its effects on spreading dynamics and community detection and this interactive storyboard.
Infomap can identify overlapping modules by first representing the dynamics with memory nodes and then clustering the memory nodes in hard partitions.).
Infomap can identify (overlapping) modules in multilayer (multiplex) networks that may not be identified in a single aggregated network or by analyzing the layers separately. See Identifying modular flows on multilayer networks reveals highly overlapping organization in interconnected systems and this interactive storyboard.
The speed and accuracy of Infomap is visualized below, compared to the Louvain method, on generated benchmark networks described by Andrea Lancichinetti et al.
The hierarchical map equation measures the per-step average code length necessary to describe a random walker's movements on a network, given a hierarchical network partition, but the challenge is to find the partition that minimizes the description length. Into how many hierarchical levels should a given network be partitioned? How many modules should each level have? And which nodes should be members of which modules?
Below we describe the Infomap algorithm and start with a description of the original two-level clustering of nodes into modules
The core of the algorithm follows closely the Louvain method: neighboring nodes are joined into modules, which subsequently are joined into supermodules and so on. First, each node is assigned to its own module. Then, in random sequential order, each node is moved to the neighboring module that results in the largest decrease of the map equation. If no move results in a decrease of the map equation, the node stays in its original module. This procedure is repeated, each time in a new random sequential order, until no move generates a decrease of the map equation. Now the network is rebuilt, with the modules of the last level forming the nodes at this level, and, exactly as at the previous level, the nodes are joined into modules. This hierarchical rebuilding of the network is repeated until the map equation cannot be reduced further.
With this algorithm, a fairly good clustering of the network can be found in a very short time. Let us call this the core algorithm and see how it can be improved. The nodes assigned to the same module are forced to move jointly when the network is rebuilt. As a result, what was an optimal move early in the algorithm might have the opposite effect later in the algorithm. Because two or more modules that merge together and form one single module when the network is rebuilt can never be separated again in this algorithm, the accuracy can be improved by breaking the modules of the final state of the core algorithm in either of the two following ways:
In practice, we repeat the two extensions to the core algorithm in sequence and as long as the clustering is improved. Moreover, we apply the submodule movements recursively. That is, to find the submodules to be moved, the algorithm first splits the submodules into subsubmodules, subsubsubmodules, and so on until no further splits are possible. Finally, because the algorithm is stochastic and fast, we can restart the algorithm from scratch every time the clustering cannot be improved further and the algorithm stops. The implementation is straightforward and, by repeating the search more than once, 100 times or more if possible, the final partition is less likely to correspond to a local minimum. For each iteration, we record the clustering if the description length is shorter than the previously shortest description length.
We have generalized our search algorithm for the two-level map equation to recursively search for multilevel solutions. The recursive search operates on a module at any level; this can be all the nodes in the entire network, or a few nodes at the finest level. For a given module, the algorithm first generates submodules if this gives a shorter description length. If not, the recursive search does not go further down this branch. But if adding submodules gives a shorter description length, the algorithm tests if movements within the module can be further compressed by additional index codebooks. Further compression can be achieved both by adding one or more coarser codebooks to compress movements between submodules or by adding one or more finer index codebooks to compress movements within submodules. To test for all combinations, the algorithm calls itself recursively, both operating on the network formed by the submodules and on the networks formed by the nodes within every submodule. In this way, the algorithm successively increases and decreases the depth of different branches of the multilevel code structure in its search for the optimal hierarchical partitioning. For every split of a module into submodules, we use the two-level search algorithm described above.