Projet court pour Bioinformatique M2 Université Paris Sujet : Conception d’un programme de threading par double programmation dynamique
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Naha d98e83c8a8 add Docstrings 9 months ago
data adding datas for test 9 months ago
docs adding html version of user guide 9 months ago
result adding datas for test 9 months ago
script add Docstrings 9 months ago
src adding docstrings 9 months ago
.gitignore add doc + license 9 months ago
LICENSE.md add doc + license 9 months ago
README.md adding html version of user guide 9 months ago
struc_align_cgn.yml Changing Project Name 9 months ago

README.md

StrucAlign

Short project - M2 Bioinformatics - Université de Paris

Sujet : Conception d’un programme de threading par double programmation dynamique

Objectif : Réaliser un programme reprenant la méthode décrite dans l'article 3) basé sur la double programmation dynamique, pour plus d’information voir l'article 4). Le threading ou "enfilage" (cf articles 1), 2) et 3)) est une stratégie pour rechercher des séquences compatibles avec une structure. Seul les carbones α de la protéine seront considérés. Vous utiliserez les potentiels statistiques DOPE (data/dope.par).

Références:

  1. Jones, D.T., Taylor, W.R. & Thornton, J.M. (1992) A new approach to protein fold recognition. Nature. 358, 86-89.
  2. Jones, D.T., Miller, R.T. & Thornton, J.M. (1995) Successful protein fold recognition by optimal sequence threading validated by rigorous blind testing. Proteins. 23, 387-397.
  3. Jones, D.T. (1998) THREADER : Protein Sequence Threading by Double Dynamic Programming. (in) Computational Methods in Molecular Biology. Steven Salzberg, David Searls, and Simon Kasif, Eds. Elsevier Science. Chapter 13.
  4. Protein Structure Comparison Using SAP - Springer

Data set

Installation

Requierments

Install miniconda and git.

StrucAlign environment

Clone StrucAlign repository.

% git clone https://git.clamifa.net/Naha/strucalign.git

Move in your local repository and run conda with struct_align_cgn.yml file.

% conda env create -f struct_align_cgn.yml

And activate it.

% conda activate struc_align_cgn

Usage

StrucAlign need 3 types of files in input:

  • par: all energy scores between two amino acid by distances
  • pdb: protein data bank with all atoms and their three dimensional positions
  • fasta: amino acid sequence

All these files must be in the data/ directory, but can upload files anywhere on the system.

This project is provided with data set included and some parsed file included into /result

Parsing dope and pdb files

In script/ directory, you have two python scripts:

  • parsedope.py
  • parsepdb.py

Use these scripts for parse your input dope or pdb files.

Run these python scripts with two arguments: -i (input file) and -o (output directory).

Examples:

% python script/parsedope.py -i data/dope.par -o result/
% python script/parsepdb.py -i data/1a6k.pdb -o result/

ℹ️ Info

The output file of parse dope script will be named parsedope.par.

The output file of parse pdb script will be named according to the temp_[pdb].pdb format.

Where [pdb] is the is the name of the input protein (pdb) file which create the template

Structure alignment

To calculate the structure alignment for the given amino acid sequence and template structure, run structure_align.py python script.

⚠️ Warning!

Run the script only from the project's parent directory:

% python src/structure_align.py

This script expects 7 arguments. However, if no arguments are specified, the script will use its default values.

Options Description Default value
-t Input template pdb file result/temp_1ard_1.pdb
-d Input parsed par file (dope) result/parsedope.par
-s Input fasta file (sequence) data/1znf_1.fasta
-g Penality value must be negative (gap) 0
-p Enable paralellize pocessus True
-c Defines the number of usable cpu cores all cores
-o Output path of the generated alignment file result/

ℹ️ Info

  • For option -c (working with -p on True):
    • if 0 : Error
    • if > max of engine number of CPU : Use the max
  • By default, all output files will be named according to the align_[fasta]_[template].txt format.
    • [fasta] is the name of the input sequence (fasta) file
    • [template] is the name of the input template (pdb) input file.
  • If this script is run several times with the same structure and sequence, the file is overwritten.

Examples:

% # Run without arguments
% python src/structure_align.py
% # Run with only 1 cpu core and disabled paralellize
% python src/structure_align.py -p False -c 1

Tests

In order to use test features, go read the

ℹ️ Info To activate jupyter lab,

First activate the environment, then :

% conda jupyter lab

Versions

  • conda version : 4.10.3
  • python version : 3.9.5.final.0
  • [GCC 7.5.0] :: Anaconda, Inc. on linux
  • pandas : 1.3.2
  • scipy : 1.6.2
  • jupyter lab : 3.1.7
  • joblib : 1.0.1

StrucAlign is licensed under MIT License. You can find the complete text in LICENSE.