conkit.core.sequencefile module¶

SequenceFile container used throughout ConKit

class SequenceFile(id)[source]

A sequence file object representing a single sequence file

The SequenceFile class represents a data structure to hold Sequence instances in a single sequence file. It contains functions to store and analyze sequences.

id

str – A unique identifier

is_alignment

bool – A boolean status for the alignment

meff

int – The number of effective sequences in the SequenceFile

nseq

int – The number of sequences in the SequenceFile

remark

list – The SequenceFile-specific remarks

status

int – An indication of the sequence file, i.e alignment, no alignment, or unknown

top_sequence

Sequence, None – The first Sequence entry in the file

Examples

>>> from conkit.core import Sequence, SequenceFile
>>> sequence_file = SequenceFile("example")
>>> print(sequence_file)
SequenceFile(id="example" nseq=2)

ascii_matrix

The alignment encoded in a 2-D ASCII matrix

calculate_freq(*args, **kwargs)
calculate_meff(*args, **kwargs)
calculate_neff_with_identity(*args, **kwargs)
calculate_weights(*args, **kwargs)
diversity

The diversity of an alignment defined by $$\sqrt{N}/L$$.

N equals the number of sequences in the alignment and L the sequence length

empty

Status of emptiness of sequencefile

encoded_matrix

The alignment encoded for contact prediction

filter(min_id=0.3, max_id=0.9, inplace=False)[source]

Filter sequences from an alignment according to the minimum and maximum identity between the sequences

Parameters: min_id (float, optional) – Minimum sequence identity max_id (float, optional) – Maximum sequence identity inplace (bool, optional) – Replace the saved order of sequences [default: False] The reference to the SequenceFile, regardless of inplace SequenceFile ValueError – SequenceFile is not an alignment ValueError – Minimum sequence identity needs to be between 0 and 1 ValueError – Maximum sequence identity needs to be between 0 and 1
filter_gapped(min_prop=0.0, max_prop=0.9, inplace=True)[source]

Filter all sequences a gap proportion greater than the limit

Parameters: min_prop (float, optional) – Minimum allowed gap proportion [default: 0.0] max_prop (float, optional) – Maximum allowed gap proportion [default: 0.9] inplace (bool, optional) – Replace the saved order of sequences [default: False] The reference to the SequenceFile, regardless of inplace SequenceFile ValueError – SequenceFile is not an alignment ValueError – Minimum gap proportion needs to be between 0 and 1 ValueError – Maximum gap proportion needs to be between 0 and 1
get_frequency(symbol)[source]

Calculate the frequency of an amino acid (symbol) in each Multiple Sequence Alignment column

Returns: A list containing the per alignment-column amino acid frequency count list RuntimeError – SequenceFile is not an alignment
get_meff_with_id(identity)[source]

Calculate the number of effective sequences with specified sequence identity

get_weights(identity=0.8)[source]

Calculate the sequence weights

This function calculates the sequence weights in the the Multiple Sequence Alignment.

The mathematical function used to calculate Meff is

$M_{eff}=\sum_{i}\frac{1}{\sum_{j}S_{i,j}}$
Parameters: identity (float, optional) – The sequence identity to use for similarity decision [default: 0.8] A list of the sequence weights in the alignment list ValueError – SequenceFile is not an alignment ValueError – Sequence Identity needs to be between 0 and 1
is_alignment

A boolean status for the alignment

Returns: A boolean status for the alignment bool
meff

The number of effective sequences

neff
nseq

The number of sequences

remark

The SequenceFile-specific remarks

sort(kword, reverse=False, inplace=False)[source]

Sort the SequenceFile

Parameters: kword (str) – The dictionary key to sort sequences by reverse (bool, optional) – Sort the sequences in reverse order [default: False] inplace (bool, optional) – Replace the saved order of sequences [default: False] The reference to the SequenceFile, regardless of inplace SequenceFile ValueError – kword not in SequenceFile
status

An indication of the residue status, i.e true positive, false positive, or unknown

summary()[source]

Generate a summary for the SequenceFile

Returns: str
to_string()[source]

Return the SequenceFile as str

top_sequence

The first Sequence entry in SequenceFile

Returns: The first Sequence entry in SequenceFile Sequence
trim(start, end, inplace=False)[source]

Trim the SequenceFile

Parameters: start (int) – First residue to include end (int) – Final residue to include inplace (bool, optional) – Replace the saved order of sequences [default: False] The reference to the SequenceFile, regardless of inplace SequenceFile