cathpy.align

from cathpy.align import Align

aln = Align.new_from_stockholm('/path/to/align.sto')

aln.count_sequences
# 75

seq = aln.find_seq_by_accession('1cukA01')
seq = aln.find_seq_by_id('1cukA01/1-151')

cathpy.align - manipulating protein sequences and alignments

class cathpy.align.Align(seqs=None, *, _id=None, accession=None, description=None, aln_type=None, min_bitscore=None, cath_version=None, dops_score=None)

Object representing a protein sequence alignment.

add_groupsim()

Add groupsim annotation to this alignment.

add_scorecons()

Add scorecons annotation to this alignment.

add_sequence(seq: cathpy.align.Sequence)

Add a sequence to this alignment.

aln_positions

Returns the number of alignment positions.

copy()

Return a deepcopy of this object.

count_sequences

Returns the number of sequences in the alignment.

find_first_seq_by_accession(acc)

Returns the first Sequence with the given accession.

find_seq_by_accession(acc)

Returns the Sequence corresponding to the provided id.

find_seq_by_id(_id)

Returns the Sequence corresponding to the provided id.

get_seq_at_offset(offset)

Returns the Sequence at the given offset (zero-based).

id

Returns the id of this Align object.

insert_gap_at_offset(offset, gap_char='-')

Insert a gap char at the given offset (zero-based).

lower_case_at_offset(start, end=None)

Lower case all the residues in the given alignment window.

merge_alignment(merge_aln, ref_seq_acc: str, ref_correspondence: cathpy.align.Correspondence = None, *, cluster_label=None, merge_ref_id=False, self_ref_id=False)

Merges aligned sequences into the current object via a reference sequence.

Sequences in merge_aln are brought into the current alignment using the equivalences identified in reference sequence ref_seq_acc (which must exist in both the self and merge_aln).

This function was originally written to merge FunFam alignments according to structural equivalences identified by CORA (a multiple structural alignment tool). Moving between structure and sequence provides the added complication that sequences in the structural alignment (CORA) are based on ATOM records, whereas sequences in the merge alignment (FunFams) are based on SEQRES records. The ref_correspondence argument allows this mapping to be taken into account.

Parameters:
  • merge_aln (Align) – An Align containing the reference sequence and any additional sequences to merge.
  • ref_seq_acc (str) – The accession that will be used to find the reference sequence in the current alignment and merge_aln
  • ref_correspondence (Correspondence) – An optional Correspondence object that provides a mapping between the reference sequence found in self (ATOM records) and reference sequence as it appears in merge_aln (SEQRES records).
  • cluster_label (str) – Provide a label to differentiate the sequences being merged (eg for groupsim calculations). A default label is provided if this is None.
  • self_ref_id (str) – Specify the id to use when adding the ref sequence from the current alignment.
  • merge_ref_id (str) – Specify the id to use when adding the ref sequence from the merge alignment. By default this sequence is only inluded in the final alignment (as <id>_merge) if a custom correspondence is provided.
Returns:

Array of Sequences added to the current alignment.

Return type:

[Sequence]

Raises:

MergeCorrespondenceError – problem mapping reference sequence between alignment and correspondence

classmethod new_from_fasta(fasta_io)

Initialises an alignment object from a FASTA file / string / io

classmethod new_from_pir(pir_io)

Initialises an alignment object from a PIR file / string / io

classmethod new_from_stockholm(sto_io, *, nowarnings=False)

Initialises an alignment object from a STOCKHOLM file / string / io

read_sequences_from_fasta(fasta_io)

Parses aligned sequences from FASTA (str, file, io) and adds them to the current Align object. Returns the number of sequences that are added.

read_sequences_from_pir(pir_io)

Parse aligned sequences from PIR (str, file, io) and adds them to the current Align object. Returns the number of sequences that are added.

remove_alignment_gaps()

Return a new alignment after removing alignment positions that contain a gap for all sequences.

remove_sequence_by_id(seq_id: str)

Removes a sequence from the alignment.

sequences

Provides access to the Sequence objects in the alignment.

set_gap_char_at_offset(offset, gap_char)

Override the gap char for all sequences at a given offset.

set_id(_id)

Sets the id of this Align object.

slice_seqs(start, end=None)

Return an array of Sequence objects from start to end.

to_fasta(wrap_width=80)

Returns the alignment as a string (FASTA format)

to_pir(wrap_width=80)

Returns the alignment as a string (PIR format)

total_gap_positions

Returns the total number of gaps in the alignment.

total_positions

Returns the total number of positions in the alignment.

write_fasta(fasta_file, wrap_width=80)

Write the alignment to a file in FASTA format.

write_pir(pir_file, wrap_width=80, *, use_accession=False)

Write the alignment to a file in PIR format.

write_sto(sto_file, *, meta=None)

Write the alignment to a file in STOCKHOLM format.

class cathpy.align.Correspondence(_id=None, residues=None, **kwargs)

Provides a mapping between ATOM and SEQRES residues.

A correspondence is a type of alignment that provides the equivalences between the residues in the protein sequence (eg SEQRES records) and the residues actually observed in the structure (eg ATOM records).

Within CATH, this is most commonly initialised from a GCF file:

` aln = Correspondence.new_from_gcf('/path/to/<id>.gcf') `

TODO: allow this to be created from PDBe API endpoint.

apply_seqres_segments(segs)

Returns a new correspondence from just the residues within the segments.

atom_length

Return the number of ATOM residues

atom_sequence

Returns a Sequence corresponding to the ATOM records.

first_residue

Returns the first residue in the correspondence.

get_res_at_offset(offset: int)

Return the Residue at the given offset (zero-based)

get_res_by_atom_pos(pos)

Returns Residue corresponding to position in the ATOM sequence (ignores gaps).

get_res_by_pdb_label(pdb_label)

Returns the Residue that matches pdb_label

get_res_by_seq_num(seq_num: int)

Return the Residue with the given sequence number

get_res_offset_by_atom_pos(pos)

Returns offset of Residue at position in the ATOM sequence (ignores gaps).

last_residue

Returns the last residue in the correspondence.

classmethod new_from_gcf(gcf_io)

Create a new Correspondence object from a GCF io / filename / string.

This provides a correspondence between SEQRES and ATOM records for a given protein structure.

Example format:

>gi|void|ref1 A 1 5 A K 2 6 K G 3 7 G H 4 8 H P 5 9 P G 6 10 G P 7 10A P K 8 10B K A 9 11 A P 10 * * G 11 * * …
seqres_length

Return the number of SEQRES residues

seqres_sequence

Returns a Sequence corresponding to the SEQRES records.

set_id(_id)

Sets the id of the current Correspondence object

to_aln()

Returns the Correspondence as an Align object.

to_fasta(**kwargs)

Returns the Correspondence as a string (FASTA format).

to_gcf()

Renders the current object as a GCF string.

Example format:

>gi|void|ref1 A 1 5 A K 2 6 K G 3 7 G H 4 8 H P 5 9 P G 6 10 G P 7 10A P K 8 10B K A 9 11 A P 10 * * G 11 * * …
to_sequences()

Returns the Correspondence as a list of Sequence objects

class cathpy.align.Sequence(hdr: str, seq: str, *, meta=None, description=None)

Class to represent a protein sequence.

accession_and_seginfo

Returns accession and segment info for this Sequence.

cluster_id

Returns the cluster id for this Sequence.

copy()

Provide a deep copy of this sequence.

get_offset_at_seq_position(seq_pos)

Return the offset (with gaps) of the given sequence position (ignores gaps).

get_res_at_offset(offset)

Return the residue character at the given offset (includes gaps).

get_res_at_seq_position(seq_pos)

Return the residue character at the given sequence position (ignores gaps).

get_residues()

Returns an array of Residue objects based on this sequence.

Note: if segment information has been specified then this will be used to calculate the seq_num attribute.

Raises:OutOfBoundsError – problem mapping segment info to sequence
get_seq_position_at_offset(offset)

Returns sequence position (ignoring gaps) of the given residue (may include gaps).

id

Returns the id for this Sequence

insert_gap_at_offset(offset, gap_char='-')

Insert a gap into the current sequence at a given offset.

is_cath_domain

Returns whether this Sequence is a CATH domain.

static is_gap(res_char)

Test whether a character is considered a gap.

length()

Return the length of the sequence.

lower_case_at_offset(start, end=None)

Lower case the residues in the given sequence window.

seq

Return the amino acid sequence as a string.

seq_no_gaps

Return the amino acid sequence as a string (after removing all gaps).

set_all_gap_chars(gap_char='-')

Sets all gap characters.

set_cluster_id(id_str)

Sets the cluster id for this Sequence.

set_gap_char_at_offset(offset, gap_char)

Set the gap character at the given offset.

If the residue at a given position is a gap, then override the gap char with the given character.

set_id(_id)

Sets the id of the current Sequence object

set_lower_case_to_gap(gap_char='-')

Set all lower-case characters to gap.

slice_seq(start, end=None)

Return a slice of this sequence.

classmethod split_hdr(hdr: str) → dict

Splits a sequence header into meta information.

Parameters:hdr (str) – header string (eg ‘domain|4_2_0|1cukA01/3-23_56-123’)
Returns:header info
{
‘id’: ‘domain|4_2_0|1cukA01/3-23_56-123’, ‘accession’: ‘1cukA01’, ‘id_type’: ‘domain’, ‘id_ver’: ‘4_2_0’, ‘segs’: [Segment(3, 23), Segment(56,123)], ‘meta’: {}

}

Return type:info (dict)
to_fasta(wrap_width=80)

Return a string for this Sequence in FASTA format.

to_pir(wrap_width=60, use_accession=False)

Return a string for this Sequence in PIR format.