rnaglib.kernels

Node Similarity

Functions for comparing node similarity.

class rnaglib.kernels.node_sim.SimFunctionNode(method, depth, decay=0.5, idf=False, normalization=None, hash_init_path='/home/docs/checkouts/readthedocs.org/user_builds/rnaglib/checkouts/latest/rnaglib/utils/../data/hashing/NR_chops_hash.p')[source]

Bases: object

Factory object to compute all node similarities. These methods take as input an annotated pair of nodes and compare them.

These methods are detailed in the supplemental of the paper, but include five methods. These methods frequently rely on the hungarian algorithm, an algorithm that finds optimal matches according to a cost function.

Three of them compare the edges :

  • R_1 compares the histograms of each ring, possibly with an idf weighting (to emphasize differences in rare edges)

  • R_iso compares each ring with the best matching based on the isostericity values

  • hungarian compares the whole annotation, with the rings being differentiated with an additional ‘depth’ field.

Then all the nodes are compared based on isostericity and this depth field.

Two of them compare the graphlets. The underlying idea is that just comparing lists of edges does not constraint the graph structure, while the assembling of graphlet does it more (exceptions can be found but for most graphs, knowing all graphlets at each depth enables recreating the graph) :

  • R_graphlets works like R_iso except that the isostericity is replaced by the GED

  • graphlet works like the hungarian except that the isostericity is replaced by the GED

Parameters:
  • method – a string that identifies which of these method to use

  • depth – The depth to use in the annotations rings

  • decay – When using rings comparison function, the weight decay of importance based on the depth (the

closest rings weigh more as they are central to the comparison) :param idf: Whether to use IDF weighting on the frequency of labels. :param normalization: We experiment with three normalization scheme, the basal one is just a division of the score by the maximum value, ‘sqrt’ denotes using the square root of the ratio as a power of the raw value and ‘log’ uses the log. The underlying idea is that we want to put more emphasis on the long matches than on just matching a few nodes :param hash_init_path: For the graphlets comparisons, we need to supply a hashing path to be able to store the values of ged and reuse them based on the hash.

add_hashtable(hash_init_path)[source]
Parameters:

hash_init_path – A string with the full path to a pickled hashtable

Returns:

None, modify self.

compare(rings1, rings2)[source]

Compares two nodes represented as their rings.

The edge list for the first hop (centered around a node) is None, so it gets skipped, when we say depth=3, we want rings[1:4], hence range(1, depth+1) Need to take this into account for normalization

(see class constructor)

param rings1:

A list of rings at each depth. Rings contain a list of node, edge or graphlet information at a

given distance from a central node. :param rings2: Same as above for another node. :return: Normalized comparison score between the nodes

normalize(unnormalized, max_score)[source]

We want our normalization to be more lenient to longer matches

Parameters:
  • unnormalized – a score in [0, max_score]

  • max_score – the best possible matching score of the sequences we are given

Returns:

a score in [0,1]

get_length(ring1, ring2, graphlets=False)[source]

This is meant to return an adapted ‘length’ that represents the optimal score obtained when matching all the elements in the two rings at hands

param rings1:

A list of rings at each depth. Rings contain a list of node, edge or graphlet information at a

given distance from a central node. :param rings2: Same as above for another node. :param graphlets: Whether we use graphlets instead of edges. Then no isostericity can be used to compute length

Returns:

a float that represents the score of a perfect match

static delta_indices_sim(i, j, distance=False)[source]

We need a scoring related to matching different depth nodes. Returns a positive score in [0,1]

Parameters:
  • i – pos of the first node

  • j – pos of the second node

Returns:

A normalized value of their distance (exp(abs(i-j))

get_cost_nodes(node_i, node_j, bb=False, pos=False)[source]

Compare two nodes and returns a cost.

Returns a positive number that has to be negated for minimization

:param node_i : This either just contains a label to be compared with isostericity, or a tuple that also includes distance from the root node :param node_j : Same as above :param bb : Check if what is being compared is backbone (no isostericity then) :param pos: if pos is true, nodes are expected to be (edge label, distance from root) else just a edge label. pos is True when used from a comparison between nodes from different rings :return: the cost of matching those two nodes

R_1(ring1, ring2)[source]

Compute R_1 function over lists of features by counting intersect and normalise by the number

Parameters:
  • ring1 – list of features

  • ring2 – Same as above for other node

Returns:

Score

R_iso(list1, list2)[source]

Compute R_iso function over lists of features by matching each ring with the hungarian algorithm on the iso values

We do a separate computation for backbone.

Parameters:
  • list1 – list of features

  • list2

    ’’

Returns:

Score

hungarian(rings1, rings2)[source]

Compute hungarian function over lists of features by adding a depth field into each ring (based on its index in rings). Then we try to match all nodes together, to deal with bulges for instances.

We do a separate computation for backbone.

Parameters:
  • list1 – list of features

  • list2

    ’’

Returns:

Score

graphlet_cost_nodes(node_i, node_j, pos=False, similarity=False)[source]
Returns a node distance between nodes represented as graphlets

Compare two nodes and returns a cost.

Returns a positive number that has to be negated for minimization

:param node_i : This either just contains a label to be compared with isostericity, or a tuple that also includes distance from the root node :param node_j : Same as above :param pos: if pos is true, nodes are expected to be (graphlet, distance from root) else just a graphlet. pos is True when used from a comparison between nodes from different rings :return: the cost of matching those two nodes

R_graphlets(ring1, ring2)[source]

Compute R_graphlets function over lists of features.

Parameters:
  • ring1 – list of list of graphlets

  • ring2 – Same as above for other node

Returns:

Score

graphlet(rings1, rings2)[source]

This function performs an operation similar to the hungarian algorithm using ged between graphlets instead of isostericity.

We also add a distance to root node attribute to each graphlet and then match them optimally

Parameters:
  • ring1 – list of list of graphlets

  • ring2 – Same as above for other node

Returns:

Score

rnaglib.kernels.node_sim.graph_edge_freqs(graphs, stop=False)[source]

Get IDF for each edge label over whole dataset. First get a total frequency dictionnary :{‘CWW’: 110, ‘TWW’: 23} Then compute IDF and returns the value.

Parameters:
  • graphs – The graphs over which to compute the frequencies, a list of nx graphs

  • stop – Set to True for just doing it on a subset

Returns:

A dict with the idf values.

rnaglib.kernels.node_sim.pdist_list(rings, node_sim)[source]

Defines the block creation using a list of rings at the graph level (should also ultimately include trees) Creates a SIMILARITY matrix.

Parameters:
  • rings – a list of rings, dictionnaries {node : (nodelist, edgelist)}

  • node_sim – the pairwise node comparison function

Returns:

the upper triangle of a similarity matrix, in the form of a list

rnaglib.kernels.node_sim.k_block_list(rings, node_sim)[source]

Defines the block creation using a list of rings at the graph level (should also ultimately include trees) Creates a SIMILARITY matrix.

Parameters:
  • rings – a list of rings, dictionnaries {node : (nodelist, edgelist)}

  • node_sim – the pairwise node comparison function

Returns:

A whole similarity matrix in the form of a numpy array that follows the order of rings

rnaglib.kernels.node_sim.simfunc_time(simfuncs, graph_path, batches=1, batch_size=5, names=None)[source]

Do time benchmark on a list of simfunc.

Parameters:
  • simfuncs

  • graph_path

  • batches

  • batch_size

  • names

Returns: