Search
Categories
Documents
CompBio - Core library for some basic methods useful in computational biology/bioinformatics. (Displayed) README
|
CompBio - Core library for some basic methods useful in computational biology/bioinformatics.
CompBio - Core library for some basic methods useful in computational biology/bioinformatics.
use CompBio;
my $cbc = new->CompBio;
$AR_faseqs = $cbc->tbl_to_fa(\@tblseqs);
or
use CompBio qw(tbl_to_fa);
$AR_faseqs = tbl_to_fa(\@tblseqs);
The CompBio module set is being developed as a new implementation of the code
base originally developed at the BioMolecular Engineering Research Center
(http://bmerc-www.bu.edu). CompBio.pm is intended to take a number
of small, commonly used methods for supporting bioinformatics research, and
make them into a single package. Many of the utils included with this
distribution are just command line interfaces to the methods contained herein.
The CompBio module set is _not_ intended to replace the bioperl project
(http://www.bioperl.org/). Although I do welcome suggestions for improving
or adding to the methods available in these modules, particularly I would
love any help with things on the TO DO list, these modules are not intended
to provide the depth that the bioperl suite can provide.
CompBio has a limited API. It expects it's input to be in specific
formats, as described in each methods description, and it's output
is in a format that makes the most sense to me for that method. It does
no error checking by and large, so incorrect input could cause bizzare
behavior and/or a noisy death. If you want a flexible interface with lots
of error checking and deep levels of verbosity, use the CompBio::Simple manpage - that's
its job.
Latest version is available through SourceForge
http://sourceforge.net/projects/compbio/, or on the CPAN
http://www.cpan.org/authors/id/S/SE/SEANQ/CompBio-0.461.tar.gz.
Thanks!
Other modules available (or that will be available) in the CompBio set are:
The DB module will only be imediately useful if you import the databases as
used by us here at the BMERC(ftp://mcclintock.bu.edu/BMERC/mysql/) or develop
your own on the same basic design scheme. Otherwise I hope you find it useful
as a source of design ideas for rolling your own. Please note however that I
intend to expand the methods in CompBio/Simple.pm to allow seemless access to
data through the DB module. Although at no time will including the DB module be
required for Simple to work, I think that if you have the space for it, you will
find having the data locally and adding this module (or adapting it to work with
databases already installed) will have an imense and imediate benifiacial
impact. It did for us at least! :)
The Profile module was designed to work with our PIMA-II sofware and the
PIMA modules. The PIMA suite is available for license from Boston University
for a nominal fee, and free for academic use. For examples and more info
see http://bmerc-www.bu.edu/PIMA/. Unless you have or are interested in that
package, this module will have no functional value.
You may note that the majority of the methods here are for converting
sequences from one format to another. Mainly this is for converting other
formats to table format, which is used by most of these programs. This is
not meant to be a comprhensive collection of format guessing and
transformation methods. If you are looking for a converter for a format
CompBio doesn't handle, I suggest you look into bioperls SeqIO package or
the READSEQ program (java), found at
http://iubio.bio.indiana.edu/soft/molbio/readseq/
Construct an object for invoking methods in CompBio.
Quits current application and uses perldoc to display the POD.
Checks a given sequence or set of sequences for it's type. Currently groks
fasta(.fa), table(.tbl), raw genome(.raw), intelligenics[?](.ig) and coding
dna sequence(.cdna) types. Each index of the referenced array should
be an entire sequence record. * It is however not recomended that you load
up an entire raw genome into memory to do this test - see perlfunc read *
$type = check_type(\@seqs,%parameters);
Possible return types are CDNA, TBL, FA, IG, RAW and UNKNOWN.
Be warned, this is intended only as a quick check and only uses as many
records as necisarry in the reference provided, stating with the first.
check_type assumes the rest of the records look the same and does not do any
kind of deep QA on the set. If you are not sure, invoke check_type with
a few random samples from your set, or use the CompBio::Simple manpage, which does
that by default.
Converts a sequence record in table (tab delimited, usually .tbl file extension)
format to fasta format.
$aref_faseqs = $cbc-tbl_to_fa(\@seqdat,%params);>
Each index in the @seqdat array must contain entire record (loci\tsequence) for
single sequence. Return is an array reference, still one sequence per index.
Extra data fields in the table format will be added to the annotation line (>)
in the fasta output.
Converts a sequence in table (tab delimited) format to .ig format. Accepts
sequences in a referenced array, one record per index.
$aref_igseqs = $cbc->tbl_to_ig(\@tbl_seqs,%params);
Extra data fields in the table format will be placed on a single comment
line (;) in the ig output.
Accepts fasta format sequences in a referenced array, one complete sequence
record per index. This method returns the sequence(s) in table(.tbl) format
contained in a referenced array.
$aref_faseqs = $cbc-fa_to_tbl(\@fa_seq);>
Extra data fields in the fasta format will be placed as extra tab delimited
values after the sequence record.
Options:
CLEAN: Reduces signifier to the first strech of non white space characters
found at least 4 characters long with no of the characters (| \ ? / ! *)
Accepts ig format sequences in a referenced array, one complete sequence
record per index. This method returns the sequence(s) in table(.tbl) format
contained in a referenced array.
$aref_igseqs = $cbc-ig_to_tbl(\@fa_seq);>
Extra comment lines in the ig format will be placed as extra tab delimited
values after the sequence record.
Converts a dna sequence, containing no whitespace, submited as a scalar reference,
to the amino acid residues as coded by the standard 'universal' genetic code.
Return is a reference to a scalar. dna sequence may contain standard special
characters (i.e. R,S,B,N, ect.).
$aa = dna_to_aa(\$dna_seq,%params);
Options:
C: Set to a true value to indicate dna should be converted to it's complement
before translation.
ALTCODE: A reference to a hash containing alternate coding keys where the value
is the new aa to code for. Stop codons are represented by ``.''. Multiple
allowable values for the final dna in the codon may be provided in the
format AT[TCA].
SEQFIX: Set to true to alter first position, making V or L an M, and
removing stop in last position.
Converts dna to it's complementary strand. DNA sequence is submitted as scalar
by reference. There is no return as sequence is modified. To maintain original
sequence, send a reference to a copy (or see documentation for
CompBio::Simple::complement).
complement(\$dna);
Converts a submitted dna sequence into all 6 frame translations to aa. Value
submitted may be a reference to a scalar containing the dna or a filename.
Either way dna must be in RAW(.raw) format (Not whitespace within sequence,
only one newline allowed at end of record). Output id's have start and stop
positions encoded; if first value is larger, translated from anti-sense.
$result = six_frame($raw_file,$id,$seq_len,$out_file);
Options:
ALTCODE: A reference to a hash containing alternate coding keys where the value
is the new aa to code for. Stop codons are represented by ``.''. Multiple
allowable values for the final dna in the codon may be provided in the
format AT[TCA].
ID: Prefix for translated segment identifiers, default is SixFrame.
SEQLEN: Minimum length of aa sequence to return, default is 10.
OUTPUT: A filename to pipe results to. This is recomended for large dna
sequences such as large contigs or whole chromosomes, otherwise results are
stored in memory until process complete. If the value of OUTFILE is STDOUT
results will be sent directly to standard out.
A simple interface to the Washington University distribution of blast. Only recently
tested with full 2.0MP-WashU release. I do however stick to using the compatability
features so it it should work with both the licensed and free releases of WUblast.
wu_blast requires two arguments, a sequence database filename or a reference to
an array containing the sequence(s) to search, and the query sequence filename or
a reference to an array containing the sequence to query with. Sequence files need
to be in fasta format and sequences contained in arrays need to be in table format.
For more versatility, use CompBio::Simple::blast. One thing wu_blast will do is
prepare the db for searching if it isn't set up in advance; this does require
you have write permision in the location of the base sequence file, and sufficient
room in your quota (if used).
Options:
METHOD: Alternate method to use for sequence comparison. Default is blastp.
PARAMS: Optional parameters to hand to the blast method. See the your local
documentation for the method chosen to see what options are available.
SERVER: Machine to use as the blast server. Defaults to the global $CPUSERVER.
rsh is currently used to launch the proccess.
BLASTPREP: Alternate method to use to prepare a sequence database for being
searched using the chosen blast method. Default is setdb.
$blast_out = blast($fa_file,$queryfile,(PARAMS => "-B=0 -matrix=BLOSUM62"));
Creates a hash table with amino acid lookups by codon. Includes all cases where
even an alternate na code (such as M for A or C) would return an unambiguous aa.
Also consistent with the complement method in this package, ie, lower cases in
some contexts, for ease of use with six_frame.
C<%aa_hash = aa_hash;>
None by default.
- 01
Original version; created by h2xs 1.20 with options
-AXC -n CompBio
- 20
Begin porting core function code for most initial methods from various
sources at BMERC. Write protocode and placeholders for what I want in
initial 'complete' version.
- 44
Almost everything eneded up being rewritten practically from scratch.
Removed lingering locale assumptions and (hopefully) improving
interface (added use of %params mainly).
Added OOP useability so this module should now work either way.
- 45
Modifications to Simple primarilly.
- 451
Fixed faulty statement in MANIFEST.SKIP that excluded Makefile.PL from
distribution (thanks to Andreas Riechert for pointing this out to me almost
imediately).
- 46
Added some documentation, mostly on options.
Added the ALTCODE option to six_frame.
Finished modifying tbl_to* converters to make a pass at handeling extra
data feilds from table format and changing *_to_tbl converters to keep
extra data besides id in extra fields in table format.
- 461
Started working on bringing the utility programs included into using
CompBio::Simple, using dna_to_aa as first trial. Also some small changes
to docs.
- 464
Major updates and alterations. Got most of the CompBio utils at least minimally
working. Added $RET_CODE usage to CompBio, setting old behavior as default, but
allowing Simple to provide as a parameter, so that return formating (such as
setting dna_to_aa to return in fasta) and file/STDOUT behavior is done during
process, reducing memory usage. Matching major alterations to Simple.
- 466
NOTE .464 was acidently never fully distributed as such. Sorry!
Project has been added to SourceForge and the modules list on CPAN.
Worked out some more bugs, mostly cropping up from interactions with utils
such as tbl_to_fa, which interface through Simple. Mainly found places where
lazy programming broke using the files tied to an array, so had to use less of
$_. Things are shaking out nicely, and I think all the core bits are now in
place. Once the utils are done, I'll do a major code cleanup and comment run
and step up to 0.5 and push release. Finishing adding converters and fixing any
reported bugs (new converters shouldn't add any new bugs, except inside the new
methods) will benchmark going to 0.6.
- 467
Minor fixes, some work on getting utils ready.
- 468
Testing utils and started working on getting them to state. Improved fa_to_tbl
basic genbank | parsing without using CLEAN, but still needs some work. I despair
ever resolving this to my satisfaction, but at least now almost all the stuff not
used is kept in the extra field allowed in the latest table file spec. .47 line
will be working on getting the utils all working and writing tests for them. The
test script might end up as long as the module at this rate!
- 469
More minor fixes to CompBio. Retruned to the version of Simple.pm from the .466
dist., as it was the last to complete all it's tests. I don't know how soon I'll
be able to resolve those problems. But the base version of the DB module is now
in, and overall I'm very pleased. It hasn't strayed far from the original BMERC
module as far as purpose of methods, but the code is a lot nicer. I've also
included the sql scripts for generating the base set of MySQL databases it's
designed to work with, and the .MRG files I use.
I like the basic design so far - except where large sequence sets are to be munged.
There this becomes a seriouse memory hog. Some faster & more efficient way
needs to be provided when dealing with files & large data sets.
complement and dna_to_aa need to be loop enabled, also alowing
&$RET_CODE(\@ret,$str); to be used. Why am I using _munge_array?!?
**Got it - needs work!** Get the full release of WashU Blast and make sure the
blast stuff works with it as well.
OK, rethink blast method(s) completely. Old and kludged and needs to die.
Add a method for handeling ncbi blast. CompBio::Simple should only have a blast
method that will DTRT and use the appropriate core methods. Need to revisit
only loading modules as needed, as this method should use the XML output
preferentially.
Stop using lowercase abiguity codes after complementing! (i.e. out of %AA)
If make install isn't putting right perl line in executibles, run perl config
myself?
Add DNA* and GCG format handelers
Add handler for the genbank report type format:
1 attgc gtgct
11 gtgtg gacaa
Which, though annoying, seems a certain candidate for recieving in cut and paste
operations, particularly through the web.
Better way to handle CPUSERVER. There must be some way to allow the user to
define (presumably during ./config?) how to submit jobs for more intensive
computational tools such as Blast and PIMA. Things like rsh, submitting to a
batch queue, etc should all be definable at install somehow.
write configure script to populate these globals as part of install
process.
modify tests to look for inclusion of Profile.pm or such and skip testing
where apropriate if the PIMA suite is not installed.
Find out if there is a way using OOP to allow a user to only need to include
this module and create new objects for the submodules, or even better just
DTRT. Can this be done just through inheritence? Should it be triggered by
requested export (like use CompBio qw(Simple DB)? Or through an AUTOLOAD
type interface, returning the correct object ($cbs = CompBio->Simple(``new'')
or some such)? I think that would be far more desirable than _having_ to
use a bunch of modules all in the CompBio namespace.
_error (all packages) needs to correctly report line where error occured.
Can this be done through caller or do I need to pass manually?
hmmm, how about a simple tm calculator for primers? I seem to be able to
find lots of them that work through web interface, or more complex primer
selection software, but none that plug and play to a perl program, and I'm
needing one. Not a primer selecter, just a _solid_ (more than just GC content),
straight foward Tm calculator.
Developed at the BioMolecular Engineering Research Center at Boston
University under the NHLBI's Programs for Genomic Applications grant.
Copyright Sean Quinlan, Trustees of Boston University 2000-2002.
All rights reserved. This program is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.
Sean Quinlan, seanq@darwin.bu.edu
Please email me with any changes you make or suggestions for changes/additions.
Latest version is available through SourceForge
http://sourceforge.net/projects/compbio/, or on the CPAN
http://www.cpan.org/authors/id/S/SE/SEANQ/.
Thank you!
I would like to thank the staff at the BMERC for being my guinee pigs for
the past few years, Jim Freeman who got me into this and left me with flint
and tinder rather than the ember in a clay pot he started with, and my boss
and (tor)mentor Dr. Temple F. Smith! I would never have gotten this far or
learned this much without them.
perl(1), CompBio::Simple(1).
Information
|
This site is currently in testing, it is not yet operating using the full database. Until it is officially launched you may wish to visit Help-Site Computer Manuals. After launch, this site (HelpSpy) will replace Help-Site. Information about the spider which is currently trawling the Internet looking for links to add to this directory can be found here. |
|