rnaglib.data_loading

class rnaglib.data_loading.RNADataset(data_path=None, version='1.0.0', download_dir=None, redundancy='nr', all_graphs=None, representations=(), rna_features=None, nt_features=None, bp_features=None, rna_targets=None, nt_targets=None, bp_targets=None, annotated=False, verbose=False)[source]

This class is the main object to hold the core RNA data annotations. The RNAglibDataset.all_rnas object is a generator networkx objects that hold all the annotations for each RNA in the dataset. You can also access individual RNAs on-disk with RNAGlibDataset()[idx] or RNAGlibDataset().get_pdbid('1b23')

Parameters:
  • representations – List of rnaglib.Representation objects to apply to each item.

  • data_path – The path to the folder containing the graphs. If node_sim is not None, this data should be annotated

  • version – Version of the dataset to use (default=’0.0.0’)

  • redundancy – To use all graphs or just the non redundant set.

  • all_graphs – In the given directory, one can choose to provide a list of graphs to use

subset(list_of_graphs)[source]

Create another dataset with only the specified graphs

Parameters:

list_of_graphs – a list of graph names

Returns:

A graphdataset

get_pdbid(pdbid)[source]

Grab an RNA by its pdbid

get_nt_encoding(g, encode_feature=True)[source]

Get targets for graph g for every node get the attribute specified by self.node_target output a mapping of nodes to their targets

Parameters:
  • g – a nx graph

  • encode_feature – A boolean as to whether this should encode the features or targets

Returns:

A dict that maps nodes to encodings

compute_dim(node_parser)[source]

Based on the encoding scheme, we can compute the shapes of the in and out tensors

Returns:

compute_features(rna_dict)[source]

Add 3 dictionaries to the rna_dict wich maps nts, edges, and the whole graph to a feature vector each. The final converter uses these to include the data in the framework-specific object.

rnaglib.data_loading.get_loader(dataset, batch_size=5, num_workers=0, split=True, split_train=0.7, split_valid=0.85, verbose=False, framework='dgl')[source]

Fetch a loader object for a given dataset.

Parameters:
  • dataset (rnaglib.data_loading.RNADataset) – Dataset for loading.

  • batch_size (int) – number of items in batch

  • split (bool) – whether to compute splits

  • split_train (float) – proportion of dataset to keep for training

  • split_valid (float) – proportion of dataset to keep for validation

  • verbose (bool) – print updates

  • framework (str) – learning framework to use (‘dgl’)

Returns:

torch.utils.data.DataLoader

class rnaglib.data_loading.Collater(dataset)[source]

Wrapper for collate function, so we can use different node similarities. We cannot use functools.partial as it is not picklable so incompatible with Pytorch loading

Initialize a Collater object.

Parameters:

node_simfunc – A node comparison function as defined in kernels, to optionally return a pairwise

comparison of the nodes in the batch :param max_size_kernel: If the node comparison is not None, optionnaly only return a pairwise comparison between a subset of all nodes, of size max_size_kernel :param hstack: If True, hstack point cloud return

Returns:

a picklable python function that can be called on a batch by Pytorch loaders

collate(samples)[source]

New format that iterates through the possible keys returned by get_item

The graphs are batched, the rings are compared with self.node_simfunc and the features are just put into a list. :param samples: :return: a dict