Here we provide the source code to the Infomap algorithm for detecting communities in large networks. Infomap optimizes the map equation, which eploits the information-theoretic duality between the problem of compressing data, and the problem of detecting and extracting significant patterns or structures within those data. Infomap handles links that are: unweighted, weighted, undirected, and directed and clusters tightly interconnected nodes into modules (two-level clustering) or the optimal number of nested clusters (multi-level clustering).
The latest Infomap code can be downloaded here (May 8, 2013): Infomap-0.11.4.zip.
Infomap is written in C++ and can be compiled with gcc using the included Makefile. To extract the zipped archive and compile the source code in a Unix-like environment, open a terminal and type this:
cd [path/to/Infomap]
unzip Infomap-0.11.4.zip
cd Infomap-0.11.4
make
Substitute [path/to/Infomap] to the folder where Infomap was downloaded, e.g. ~/Downloads.
To be able to run on Windows, you can install MinGW/MSYS to get a minimalist development environment where the above instructions will work. Please follow the instructions on MinGW - Getting Started or download the complete MinGW-MSYS Bundle.
Usage: Infomap [options] network_data dest
The optional arguments can be put anywhere. Run ./Infomap --help or check the Options section to see the available options. The network_data should point to a valid network file (see Input section) and dest to a directory where the Infomap should write its output files.
option is given, Infomap will assume an undirected network and try to partition it hierarchically.mkdir output
./Infomap ninetriangles.net output/ -N 10
Here, we first create a new directory where we want the put the results, and feed that to Infomap together with an input network and an option. Now Infomap will try to parse the file ninetriangles.net as an undirected network, partition it hierarchically and write the best result out of 10 attempts to the output directory.
./Infomap ninetriangles.net output/ -N 10 --directed --two-level
Here, Infomap will treat the network as directed and try to find the optimal two-level partition with respect to the flow. This will also create an output .map file that can be visualized using the Map Generator.
Currently, Infomap understands two different file formats for the input network data, a minimal link list format and the Pajek format. Specify the input format to Infomap with the -i or --input-format flag together with pajek or link-list as its argument. If no input format is specified, Infomap will try to parse the file by this assumption:
.net - Pajek format.txt - Link list formatA link list is a minimal format to describe a network by only specifying a set of links:
# A network in link list format
1 2 1
1 3 1
2 3 2
3 5 0.5
...
Each line corresponds to the triad source target weight which describes a weighted link between the nodes with specified numbers. The weight can be any non-negative value. If omitted, the link weight will default to 1. The first node are assumed to start from 1* and the total number of nodes will be determined by the maximum node number.
The Pajek format specifies both the nodes and the links in two different sections of the file:
# A network in Pajek format
*Vertices 27
1 "1"
2 "2"
3 "3"
4 "4"
...
*Edges 33
1 2 1
1 3 1
1 4 1
2 3 1
...
The node section must start with *Vertices N and the link section with *Edges N or *Arcs N (case insensitive), where N is the number of nodes or links. The characters within each quotation mark defines the corresponding node name. Weights can be given to nodes by adding a third column with positive numbers, and the list of links are treated the same as for a link list.
0, you have to add the flag -z or --zero-based to Infomap, otherwise it will assume that the node numbering starts from 1.
The resulting hierarchy will be written to a file with the extension .tree (plain text file) and corresponds to the best hierarchical partition (shortest description length) of the attempts. The output format has the pattern:
# Codelength = 3.48419 bits.
1:1:1 0.0384615 "7"
1:1:2 0.0384615 "8"
1:1:3 0.0384615 "9"
1:2:1 0.0384615 "4"
1:2:2 0.0384615 "5"
...
Each row (except the first one, which summarizes the result) begins with the multilevel module assignments of a node. The module assignments are colon separated from coarse to fine level, and all modules within each level are sorted by the total PageRank of the nodes they contain. Further, the integer after the last comma is the rank within the finest-level module, the decimal number is the steady state population of random walkers, and finally, within quotation marks, is the node name.
A .map file (plain text file) describes the best two-level partition (shortest description length) of the attempts. The format has the pattern:
# modules: 4
# modulelinks: 4
# nodes: 16
# links: 20
# codelength: 3.32773
*Directed
*Modules 4
1 "Node 13,..." 0.25 0.0395432
2 "Node 5,..." 0.25 0.0395432
3 "Node 9,..." 0.25 0.0395432
4 "Node 1,..." 0.25 0.0395432
*Nodes 16
1:1 "Node 13" 0.0820133
1:2 "Node 14" 0.0790863
1:3 "Node 16" 0.0459137
1:4 "Node 15" 0.0429867
2:1 "Node 5" 0.0820133
2:2 "Node 6" 0.0790863
2:3 "Node 8" 0.0459137
2:4 "Node 7" 0.0429867
3:1 "Node 9" 0.0820133
3:2 "Node 10" 0.0790863
3:3 "Node 12" 0.0459137
3:4 "Node 11" 0.0429867
4:1 "Node 1" 0.0820133
4:2 "Node 2" 0.0790863
4:3 "Node 4" 0.0459137
4:4 "Node 3" 0.0429867
*Links 4
1 4 0.0395432
2 3 0.0395432
3 1 0.0395432
4 2 0.0395432
...
This file contains the necessary information to generate a visual map in the Map Generator. The names under *Modules are derived from the node with the highest flow volume within the module and 0.25 0.0395432 represent, respectively, the aggregated flow volume of all nodes within the module and the per step exit flow from the module. The nodes are listed with their module assignments together with their flow volumes. Finally, all links between the modules are listed in order from high flow to low flow.
Don't hesitate to contact us if you have questions or comments about the code.
The hierarchical map equation measures the per-step average code length necessary to describe a random walker's movements on a network, given a hierarchical network partition, but the challenge is to find the partition that minimizes the description length. Into how many hierarchical levels should a given network be partitioned? How many modules should each level have? And which nodes should be members of which modules?
Below we describe the Infomap algorithm and start with a description of the original two-level clustering of nodes into modules
The core of the algorithm follows closely the Louvain method: neighboring nodes are joined into modules, which subsequently are joined into supermodules and so on. First, each node is assigned to its own module. Then, in random sequential order, each node is moved to the neighboring module that results in the largest decrease of the map equation. If no move results in a decrease of the map equation, the node stays in its original module. This procedure is repeated, each time in a new random sequential order, until no move generates a decrease of the map equation. Now the network is rebuilt, with the modules of the last level forming the nodes at this level, and, exactly as at the previous level, the nodes are joined into modules. This hierarchical rebuilding of the network is repeated until the map equation cannot be reduced further.
With this algorithm, a fairly good clustering of the network can be found in a very short time. Let us call this the core algorithm and see how it can be improved. The nodes assigned to the same module are forced to move jointly when the network is rebuilt. As a result, what was an optimal move early in the algorithm might have the opposite effect later in the algorithm. Because two or more modules that merge together and form one single module when the network is rebuilt can never be separated again in this algorithm, the accuracy can be improved by breaking the modules of the final state of the core algorithm in either of the two following ways:
In practice, we repeat the two extensions to the core algorithm in sequence and as long as the clustering is improved. Moreover, we apply the submodule movements recursively. That is, to find the submodules to be moved, the algorithm first splits the submodules into subsubmodules, subsubsubmodules, and so on until no further splits are possible. Finally, because the algorithm is stochastic and fast, we can restart the algorithm from scratch every time the clustering cannot be improved further and the algorithm stops. The implementation is straightforward and, by repeating the search more than once, 100 times or more if possible, the final partition is less likely to correspond to a local minimum. For each iteration, we record the clustering if the description length is shorter than the previously shortest description length.
We have generalized our search algorithm for the two-level map equation to recursively search for multilevel solutions. The recursive search operates on a module at any level; this can be all the nodes in the entire network, or a few nodes at the finest level. For a given module, the algorithm first generates submodules if this gives a shorter description length. If not, the recursive search does not go further down this branch. But if adding submodules gives a shorter description length, the algorithm tests if movements within the module can be further compressed by additional index codebooks. Further compression can be achieved both by adding one or more coarser codebooks to compress movements between submodules or by adding one or more finer index codebooks to compress movements within submodules. To test for all combinations, the algorithm calls itself recursively, both operating on the network formed by the submodules and on the networks formed by the nodes within every submodule. In this way, the algorithm successively increases and decreases the depth of different branches of the multilevel code structure in its search for the optimal hierarchical partitioning. For every split of a module into submodules, we use the two-level search algorithm described above.
The input options to Infomap is listed below:
Argument types are defined like: