cathpy.align¶
from cathpy.align import Align
aln = Align.new_from_stockholm('/path/to/align.sto')
aln.count_sequences
# 75
seq = aln.find_seq_by_accession('1cukA01')
seq = aln.find_seq_by_id('1cukA01/1-151')
cathpy.align - manipulating protein sequences and alignments
-
class
cathpy.align.
Align
(seqs=None, *, _id=None, accession=None, description=None, aln_type=None, min_bitscore=None, cath_version=None, dops_score=None)¶ Object representing a protein sequence alignment.
-
add_groupsim
()¶ Add groupsim annotation to this alignment.
-
add_scorecons
()¶ Add scorecons annotation to this alignment.
-
add_sequence
(seq: cathpy.align.Sequence)¶ Add a sequence to this alignment.
-
aln_positions
¶ Returns the number of alignment positions.
-
copy
()¶ Return a deepcopy of this object.
-
count_sequences
¶ Returns the number of sequences in the alignment.
-
find_first_seq_by_accession
(acc)¶ Returns the first Sequence with the given accession.
-
find_seq_by_accession
(acc)¶ Returns the Sequence corresponding to the provided id.
-
find_seq_by_id
(_id)¶ Returns the Sequence corresponding to the provided id.
-
get_seq_at_offset
(offset)¶ Returns the Sequence at the given offset (zero-based).
-
id
¶ Returns the id of this Align object.
-
insert_gap_at_offset
(offset, gap_char='-')¶ Insert a gap char at the given offset (zero-based).
-
lower_case_at_offset
(start, end=None)¶ Lower case all the residues in the given alignment window.
-
merge_alignment
(merge_aln, ref_seq_acc: str, ref_correspondence: cathpy.align.Correspondence = None, *, cluster_label=None, merge_ref_id=False, self_ref_id=False)¶ Merges aligned sequences into the current object via a reference sequence.
Sequences in
merge_aln
are brought into the current alignment using the equivalences identified in reference sequenceref_seq_acc
(which must exist in both theself
andmerge_aln
).This function was originally written to merge FunFam alignments according to structural equivalences identified by CORA (a multiple structural alignment tool). Moving between structure and sequence provides the added complication that sequences in the structural alignment (CORA) are based on ATOM records, whereas sequences in the merge alignment (FunFams) are based on SEQRES records. The
ref_correspondence
argument allows this mapping to be taken into account.Parameters: - merge_aln (Align) – An Align containing the reference sequence and any additional sequences to merge.
- ref_seq_acc (str) – The accession that will be used to find the reference sequence in the current alignment and merge_aln
- ref_correspondence (Correspondence) – An optional Correspondence
object that provides a mapping between the reference
sequence found in
self
(ATOM records) and reference sequence as it appears inmerge_aln
(SEQRES records). - cluster_label (str) – Provide a label to differentiate the sequences
being merged (eg for groupsim calculations). A default label
is provided if this is
None
. - self_ref_id (str) – Specify the id to use when adding the ref sequence from the current alignment.
- merge_ref_id (str) – Specify the id to use when adding the ref sequence
from the merge alignment. By default this sequence is only inluded
in the final alignment (as
<id>_merge
) if a custom correspondence is provided.
Returns: Array of Sequences added to the current alignment.
Return type: [Sequence]
Raises: MergeCorrespondenceError – problem mapping reference sequence between alignment and correspondence
-
classmethod
new_from_fasta
(fasta_io)¶ Initialises an alignment object from a FASTA file / string / io
-
classmethod
new_from_pir
(pir_io)¶ Initialises an alignment object from a PIR file / string / io
-
classmethod
new_from_stockholm
(sto_io, *, nowarnings=False)¶ Initialises an alignment object from a STOCKHOLM file / string / io
-
read_sequences_from_fasta
(fasta_io)¶ Parses aligned sequences from FASTA (str, file, io) and adds them to the current Align object. Returns the number of sequences that are added.
-
read_sequences_from_pir
(pir_io)¶ Parse aligned sequences from PIR (str, file, io) and adds them to the current Align object. Returns the number of sequences that are added.
-
remove_alignment_gaps
()¶ Return a new alignment after removing alignment positions that contain a gap for all sequences.
-
remove_sequence_by_id
(seq_id: str)¶ Removes a sequence from the alignment.
-
sequences
¶ Provides access to the Sequence objects in the alignment.
-
set_gap_char_at_offset
(offset, gap_char)¶ Override the gap char for all sequences at a given offset.
-
set_id
(_id)¶ Sets the id of this Align object.
-
slice_seqs
(start, end=None)¶ Return an array of Sequence objects from start to end.
-
to_fasta
(wrap_width=80)¶ Returns the alignment as a string (FASTA format)
-
to_pir
(wrap_width=80)¶ Returns the alignment as a string (PIR format)
-
total_gap_positions
¶ Returns the total number of gaps in the alignment.
-
total_positions
¶ Returns the total number of positions in the alignment.
-
write_fasta
(fasta_file, wrap_width=80)¶ Write the alignment to a file in FASTA format.
-
write_pir
(pir_file, wrap_width=80, *, use_accession=False)¶ Write the alignment to a file in PIR format.
-
write_sto
(sto_file, *, meta=None)¶ Write the alignment to a file in STOCKHOLM format.
-
-
class
cathpy.align.
Correspondence
(_id=None, residues=None, **kwargs)¶ Provides a mapping between ATOM and SEQRES residues.
A correspondence is a type of alignment that provides the equivalences between the residues in the protein sequence (eg
SEQRES
records) and the residues actually observed in the structure (egATOM
records).Within CATH, this is most commonly initialised from a GCF file:
` aln = Correspondence.new_from_gcf('/path/to/<id>.gcf') `
TODO: allow this to be created from PDBe API endpoint.
-
apply_seqres_segments
(segs)¶ Returns a new correspondence from just the residues within the segments.
-
atom_length
¶ Return the number of ATOM residues
-
atom_sequence
¶ Returns a Sequence corresponding to the ATOM records.
-
first_residue
¶ Returns the first residue in the correspondence.
-
get_res_at_offset
(offset: int)¶ Return the
Residue
at the given offset (zero-based)
-
get_res_by_atom_pos
(pos)¶ Returns Residue corresponding to position in the ATOM sequence (ignores gaps).
-
get_res_by_pdb_label
(pdb_label)¶ Returns the Residue that matches pdb_label
-
get_res_by_seq_num
(seq_num: int)¶ Return the
Residue
with the given sequence number
-
get_res_offset_by_atom_pos
(pos)¶ Returns offset of Residue at position in the ATOM sequence (ignores gaps).
-
last_residue
¶ Returns the last residue in the correspondence.
-
classmethod
new_from_gcf
(gcf_io)¶ Create a new Correspondence object from a GCF io / filename / string.
This provides a correspondence between SEQRES and ATOM records for a given protein structure.
Example format:
>gi|void|ref1 A 1 5 A K 2 6 K G 3 7 G H 4 8 H P 5 9 P G 6 10 G P 7 10A P K 8 10B K A 9 11 A P 10 * * G 11 * * …
-
seqres_length
¶ Return the number of SEQRES residues
-
seqres_sequence
¶ Returns a Sequence corresponding to the SEQRES records.
-
set_id
(_id)¶ Sets the id of the current Correspondence object
-
to_aln
()¶ Returns the Correspondence as an Align object.
-
to_fasta
(**kwargs)¶ Returns the Correspondence as a string (FASTA format).
-
to_gcf
()¶ Renders the current object as a GCF string.
Example format:
>gi|void|ref1 A 1 5 A K 2 6 K G 3 7 G H 4 8 H P 5 9 P G 6 10 G P 7 10A P K 8 10B K A 9 11 A P 10 * * G 11 * * …
-
to_sequences
()¶ Returns the Correspondence as a list of Sequence objects
-
-
class
cathpy.align.
Sequence
(hdr: str, seq: str, *, meta=None, description=None)¶ Class to represent a protein sequence.
-
accession_and_seginfo
¶ Returns accession and segment info for this Sequence.
-
cluster_id
¶ Returns the cluster id for this Sequence.
-
copy
()¶ Provide a deep copy of this sequence.
-
get_offset_at_seq_position
(seq_pos)¶ Return the offset (with gaps) of the given sequence position (ignores gaps).
-
get_res_at_offset
(offset)¶ Return the residue character at the given offset (includes gaps).
-
get_res_at_seq_position
(seq_pos)¶ Return the residue character at the given sequence position (ignores gaps).
-
get_residues
()¶ Returns an array of Residue objects based on this sequence.
Note: if segment information has been specified then this will be used to calculate the seq_num attribute.
Raises: OutOfBoundsError – problem mapping segment info to sequence
-
get_seq_position_at_offset
(offset)¶ Returns sequence position (ignoring gaps) of the given residue (may include gaps).
-
id
¶ Returns the id for this Sequence
-
insert_gap_at_offset
(offset, gap_char='-')¶ Insert a gap into the current sequence at a given offset.
-
is_cath_domain
¶ Returns whether this Sequence is a CATH domain.
-
static
is_gap
(res_char)¶ Test whether a character is considered a gap.
-
length
()¶ Return the length of the sequence.
-
lower_case_at_offset
(start, end=None)¶ Lower case the residues in the given sequence window.
-
seq
¶ Return the amino acid sequence as a string.
-
seq_no_gaps
¶ Return the amino acid sequence as a string (after removing all gaps).
-
set_all_gap_chars
(gap_char='-')¶ Sets all gap characters.
-
set_cluster_id
(id_str)¶ Sets the cluster id for this Sequence.
-
set_gap_char_at_offset
(offset, gap_char)¶ Set the gap character at the given offset.
If the residue at a given position is a gap, then override the gap char with the given character.
-
set_id
(_id)¶ Sets the id of the current Sequence object
-
set_lower_case_to_gap
(gap_char='-')¶ Set all lower-case characters to gap.
-
slice_seq
(start, end=None)¶ Return a slice of this sequence.
-
classmethod
split_hdr
(hdr: str) → dict¶ Splits a sequence header into meta information.
Parameters: hdr (str) – header string (eg ‘domain|4_2_0|1cukA01/3-23_56-123’) Returns: header info - {
- ‘id’: ‘domain|4_2_0|1cukA01/3-23_56-123’, ‘accession’: ‘1cukA01’, ‘id_type’: ‘domain’, ‘id_ver’: ‘4_2_0’, ‘segs’: [Segment(3, 23), Segment(56,123)], ‘meta’: {}
}
Return type: info (dict)
-
to_fasta
(wrap_width=80)¶ Return a string for this Sequence in FASTA format.
-
to_pir
(wrap_width=60, use_accession=False)¶ Return a string for this Sequence in PIR format.
-