This is the ReadMe file for the scripts in the supplementary material for the paper: Empirical potential function for simplified protein models: Combining contact and local sequence-structure descriptors Proteins: structure, function, and bioinformatics Jinfeng Zhang, Rong Chen and Jie Liang Any report or publication of results obtained with any of these scripts should acknowledge their use by appropriate citation. Step-by-step descriptions on how to calculate dv(descriptor vector) files from a list of protein names. Start from a file containing all the protein names and other information like their length, for example, test.ls will look like: 2erl0 40 1fd3a 41 1nkzb 41 ... First get all the pdb files for all the proteins. Do % ./getPdb.pl test.ls This will make two new directories, one called wholePdb, the other called oneChain. The files in wholePdb are the original pdb files. The files in directory oneChain are those single chain proteins we want. Before going to directory oneChain we first copy the protein list and perl script to that directory. % cp *.pl test.ls oneChain; cd oneChain Now we have all the pdb files with only Ca and Cb coordinates. To get all the pdb files with Ca and calculated SC from Ca and Cb, do % awk '{print "calSc.pl -if "$1}' test.ls | sh The reason we do it like this is because calSc.pl only takes one protein name at a time. Now we have doubled the files in this directory. In addition to *.ab.pdb, we have *.sc.pdb, which will be used for calculating contacts. Next we do % awk '{print "ContSum.pl -sc -if "$1}' test.ls | sh This give us the *.abc files, which contain the contact information. We need file ContDis.txt in directory ~jinfeng/SharedData to do this. Next we do % awk '{print "alphator.pl "$1}' test.ls | sh This produces *.ang files with alpha and tau angles for each residue except the first two residues and the last residue. Next we do % awk '{print "../assign_disAngle -if "$1}' test.ls | sh This generate *.sts files, which contain the discrete state information for each protein. Next we do % awk '{print "getSeq.pl -if "$1}' test.ls |sh This generates *.seq files, which contain the amino acid sequence for each protein. Some program read the sequence.db file to get the sequence of the protein. In that case the sequence.db file also need to be updated. The sequence.db file is just the concatenated form of all sequence files. Now we have all the files ready for calculation of dv file. Finally, we do % ../wdv -fl test.ls where wdv is the executable compiled from one of the three c++ source files, writeDv455.cpp, writeDv565.cpp or writeDv610.cpp. You now have all dv files calculated. Summary: File types and how to get them. *.pdb *.ab.pdb: getPdb.pl list *.sc.pdb: calSc.pl -if *.pdb awk '{print "calSc.pl -if "$1}' list | sh *.sc.abc: ContSum.pl -sc -if *.pdb awk '{print "ContSum.pl -sc -if "$1}' list | sh *.ang: alphator.pl *.pdb awk '{print "alphator.pl "$1}' list | sh *.sts: assign_disAngle -if *.pdb awk '{print "ass_disAngle -if "$1}' list | sh *.seq: getSeq.pl -if proteinName awk `{print "getSeq.pl -if "$1}' list |sh *.dv wdv568 -fl list If you have questions, please feel free to ask me. I can be reached by email at jinfeng@bioinfo.stat.harvard.edu. Jinfeng Zhang Postdoc fellow Computational Biology Lab Statistics Department Harvard University Sept. 20, 2005