rnaglib.kernels
Node Similarity
Functions for comparing node similarity.
- class rnaglib.kernels.node_sim.SimFunctionNode(method, depth, decay=0.5, idf=False, normalization=None, hash_init_path='/home/docs/checkouts/readthedocs.org/user_builds/rnaglib/checkouts/latest/rnaglib/utils/../data/hashing/NR_chops_hash.p')[source]
Bases:
object
Factory object to compute all node similarities. These methods take as input an annotated pair of nodes and compare them.
These methods are detailed in the supplemental of the paper, but include five methods. These methods frequently rely on the hungarian algorithm, an algorithm that finds optimal matches according to a cost function.
Three of them compare the edges :
R_1 compares the histograms of each ring, possibly with an idf weighting (to emphasize differences in rare edges)
R_iso compares each ring with the best matching based on the isostericity values
hungarian compares the whole annotation, with the rings being differentiated with an additional ‘depth’ field.
Then all the nodes are compared based on isostericity and this depth field.
Two of them compare the graphlets. The underlying idea is that just comparing lists of edges does not constraint the graph structure, while the assembling of graphlet does it more (exceptions can be found but for most graphs, knowing all graphlets at each depth enables recreating the graph) :
R_graphlets works like R_iso except that the isostericity is replaced by the GED
graphlet works like the hungarian except that the isostericity is replaced by the GED
- Parameters:
method – a string that identifies which of these method to use
depth – The depth to use in the annotations rings
decay – When using rings comparison function, the weight decay of importance based on the depth (the
closest rings weigh more as they are central to the comparison) :param idf: Whether to use IDF weighting on the frequency of labels. :param normalization: We experiment with three normalization scheme, the basal one is just a division of the score by the maximum value, ‘sqrt’ denotes using the square root of the ratio as a power of the raw value and ‘log’ uses the log. The underlying idea is that we want to put more emphasis on the long matches than on just matching a few nodes :param hash_init_path: For the graphlets comparisons, we need to supply a hashing path to be able to store the values of ged and reuse them based on the hash.
- add_hashtable(hash_init_path)[source]
- Parameters:
hash_init_path – A string with the full path to a pickled hashtable
- Returns:
None, modify self.
- compare(rings1, rings2)[source]
Compares two nodes represented as their rings.
The edge list for the first hop (centered around a node) is None, so it gets skipped, when we say depth=3, we want rings[1:4], hence range(1, depth+1) Need to take this into account for normalization
(see class constructor)
- param rings1:
A list of rings at each depth. Rings contain a list of node, edge or graphlet information at a
given distance from a central node. :param rings2: Same as above for another node. :return: Normalized comparison score between the nodes
- normalize(unnormalized, max_score)[source]
We want our normalization to be more lenient to longer matches
- Parameters:
unnormalized – a score in [0, max_score]
max_score – the best possible matching score of the sequences we are given
- Returns:
a score in [0,1]
- get_length(ring1, ring2, graphlets=False)[source]
This is meant to return an adapted ‘length’ that represents the optimal score obtained when matching all the elements in the two rings at hands
- param rings1:
A list of rings at each depth. Rings contain a list of node, edge or graphlet information at a
given distance from a central node. :param rings2: Same as above for another node. :param graphlets: Whether we use graphlets instead of edges. Then no isostericity can be used to compute length
- Returns:
a float that represents the score of a perfect match
- static delta_indices_sim(i, j, distance=False)[source]
We need a scoring related to matching different depth nodes. Returns a positive score in [0,1]
- Parameters:
i – pos of the first node
j – pos of the second node
- Returns:
A normalized value of their distance (exp(abs(i-j))
- get_cost_nodes(node_i, node_j, bb=False, pos=False)[source]
Compare two nodes and returns a cost.
Returns a positive number that has to be negated for minimization
:param node_i : This either just contains a label to be compared with isostericity, or a tuple that also includes distance from the root node :param node_j : Same as above :param bb : Check if what is being compared is backbone (no isostericity then) :param pos: if pos is true, nodes are expected to be (edge label, distance from root) else just a edge label. pos is True when used from a comparison between nodes from different rings :return: the cost of matching those two nodes
- R_1(ring1, ring2)[source]
Compute R_1 function over lists of features by counting intersect and normalise by the number
- Parameters:
ring1 – list of features
ring2 – Same as above for other node
- Returns:
Score
- R_iso(list1, list2)[source]
Compute R_iso function over lists of features by matching each ring with the hungarian algorithm on the iso values
We do a separate computation for backbone.
- Parameters:
list1 – list of features
list2 –
’’
- Returns:
Score
- hungarian(rings1, rings2)[source]
Compute hungarian function over lists of features by adding a depth field into each ring (based on its index in rings). Then we try to match all nodes together, to deal with bulges for instances.
We do a separate computation for backbone.
- Parameters:
list1 – list of features
list2 –
’’
- Returns:
Score
- graphlet_cost_nodes(node_i, node_j, pos=False, similarity=False)[source]
- Returns a node distance between nodes represented as graphlets
Compare two nodes and returns a cost.
Returns a positive number that has to be negated for minimization
:param node_i : This either just contains a label to be compared with isostericity, or a tuple that also includes distance from the root node :param node_j : Same as above :param pos: if pos is true, nodes are expected to be (graphlet, distance from root) else just a graphlet. pos is True when used from a comparison between nodes from different rings :return: the cost of matching those two nodes
- R_graphlets(ring1, ring2)[source]
Compute R_graphlets function over lists of features.
- Parameters:
ring1 – list of list of graphlets
ring2 – Same as above for other node
- Returns:
Score
- graphlet(rings1, rings2)[source]
This function performs an operation similar to the hungarian algorithm using ged between graphlets instead of isostericity.
We also add a distance to root node attribute to each graphlet and then match them optimally
- Parameters:
ring1 – list of list of graphlets
ring2 – Same as above for other node
- Returns:
Score
- rnaglib.kernels.node_sim.graph_edge_freqs(graphs, stop=False)[source]
Get IDF for each edge label over whole dataset. First get a total frequency dictionnary :{‘CWW’: 110, ‘TWW’: 23} Then compute IDF and returns the value.
- Parameters:
graphs – The graphs over which to compute the frequencies, a list of nx graphs
stop – Set to True for just doing it on a subset
- Returns:
A dict with the idf values.
- rnaglib.kernels.node_sim.pdist_list(rings, node_sim)[source]
Defines the block creation using a list of rings at the graph level (should also ultimately include trees) Creates a SIMILARITY matrix.
- Parameters:
rings – a list of rings, dictionnaries {node : (nodelist, edgelist)}
node_sim – the pairwise node comparison function
- Returns:
the upper triangle of a similarity matrix, in the form of a list
- rnaglib.kernels.node_sim.k_block_list(rings, node_sim)[source]
Defines the block creation using a list of rings at the graph level (should also ultimately include trees) Creates a SIMILARITY matrix.
- Parameters:
rings – a list of rings, dictionnaries {node : (nodelist, edgelist)}
node_sim – the pairwise node comparison function
- Returns:
A whole similarity matrix in the form of a numpy array that follows the order of rings