Create embeddings directly from an adjacency matrix (e.g. numpy.array or scipy.sparse)? #92

dhimmel · 2020-04-03T15:51:10Z

I'm looking to create embeddings for the STRING protein-protein interaction database. In this notebook, I convert this network to scipy.sparse.csr_matrix and numpy.array formats. The network is big with 19,566 nodes and 11,759,454 (divide that in two for bidirectional edges).

It's easy to convert the adjacency matrix to network, e.g. with networkx.from_scipy_sparse_matrix. But this is time consuming and the resulting networkx graph takes up a lot of memory. So my question is whether there are any embedding methods that can directly take an adjacency matrix? If so, perhaps that would be computationally be more efficient than going through networkx.

Also any advice on what sort of hardware would be needed to operate on a graph of this size? Which methods are most efficient in terms of memory (and/or compute)? Thanks!

The text was updated successfully, but these errors were encountered:

palash1992 · 2020-04-07T19:18:14Z

Each of the methods has a edge_f attribute which can be used to load the graph from a file containing edges, one edge per line. That is actually pretty fast and should work for your case.

dhimmel · 2020-04-14T00:45:06Z

It looks like .learn_embedding methods with edge_f provided call graph_util.loadGraphFromEdgeListTxt:

GEM/gem/utils/graph_util.py

Lines 145 to 160 in 79be7fd

    
           def loadGraphFromEdgeListTxt(file_name, directed=True): 
        
               with open(file_name, 'r') as f: 
        
                   # n_nodes = f.readline() 
        
                   # f.readline() # Discard the number of edges 
        
                   if directed: 
        
                       G = nx.DiGraph() 
        
                   else: 
        
                       G = nx.Graph() 
        
                   for line in f: 
        
                       edge = line.strip().split() 
        
                       if len(edge) == 3: 
        
                           w = float(edge[2]) 
        
                       else: 
        
                           w = 1.0 
        
                       G.add_edge(int(edge[0]), int(edge[1]), weight=w) 
        
               return G

This function constructs the complete network in-memory as a networkx object. Creating a networkx graph is the inefficient step that is slow and memory-intensive with the STRING network.

For example, node2vec.learn_embedding ends up reading edge_f to a networkx graph, then writing it to an edgelist. From the code below, it doesn't seem the networkx graph is actually used for anything besides rewriting the edgelist file:

GEM/gem/embedding/node2vec.py

Lines 58 to 65 in 5da6632

    
           def learn_embedding(self, graph=None, edge_f=None, 
        
                               is_weighted=False, no_python=False): 
        
               args = ["node2vec"] 
        
               if not graph and not edge_f: 
        
                   raise Exception('graph/edge_f needed') 
        
               if edge_f: 
        
                   graph = graph_util.loadGraphFromEdgeListTxt(edge_f) 
        
               graph_util.saveGraphToEdgeListTxtn2v(graph, 'tempGraph.graph')

For lle.learn_embedding, it looks like a scipy.sparse matrix is actually what is required for computation, but that a networkx graph is loaded from the edgelist file to create the sparse matrix:

GEM/gem/embedding/lle.py

Lines 52 to 68 in 5da6632

    
           def learn_embedding(self, graph=None, edge_f=None, 
        
                               is_weighted=False, no_python=False): 
        
               if not graph and not edge_f: 
        
                   raise Exception('graph/edge_f needed') 
        
               if not graph: 
        
                   graph = graph_util.loadGraphFromEdgeListTxt(edge_f) 
        
               graph = graph.to_undirected() 
        
               t1 = time() 
        
               A = nx.to_scipy_sparse_matrix(graph) 
        
               normalize(A, norm='l1', axis=1, copy=False) 
        
               I_n = sp.eye(len(graph.nodes)) 
        
               I_min_A = I_n - A 
        
               u, s, vt = lg.svds(I_min_A, k=self._d + 1, which='SM') 
        
               t2 = time() 
        
               self._X = vt.T 
        
               self._X = self._X[:, 1:] 
        
               return self._X.real, (t2 - t1)

For sdne.learn_embedding, a sparse matrix is also used:

GEM/gem/embedding/sdne.py

Lines 74 to 80 in 5da6632

    
           def learn_embedding(self, graph=None, edge_f=None, 
        
                               is_weighted=False, no_python=False): 
        
               if not graph and not edge_f: 
        
                   raise Exception('graph/edge_f needed') 
        
               if not graph: 
        
                   graph = graph_util.loadGraphFromEdgeListTxt(edge_f) 
        
               S = nx.to_scipy_sparse_matrix(graph)

I'd love access the unified API of GEM, but with support for providing an edgelist file or scipy.sparse matrix directly as input, without having this converted to a networkx graph. It seems that several of the methods actually use more efficient data structures than the networkx graph and don't need networkx features at all.

Do you have any suggestions on whether this would be possible? Even if just for one method? Currently, we've used node2vec without GEM to side-step this problem (notebook), but would love the ability to use the implementations in this package!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create embeddings directly from an adjacency matrix (e.g. numpy.array or scipy.sparse)? #92

Create embeddings directly from an adjacency matrix (e.g. numpy.array or scipy.sparse)? #92

dhimmel commented Apr 3, 2020

palash1992 commented Apr 7, 2020

dhimmel commented Apr 14, 2020 •

edited

Create embeddings directly from an adjacency matrix (e.g. numpy.array or scipy.sparse)? #92

Create embeddings directly from an adjacency matrix (e.g. numpy.array or scipy.sparse)? #92

Comments

dhimmel commented Apr 3, 2020

palash1992 commented Apr 7, 2020

dhimmel commented Apr 14, 2020 • edited

dhimmel commented Apr 14, 2020 •

edited