Application of software in biological drawing
Without proper data processing and analysis tools, researchers can be overwhelmed, especially without proper training or lack of knowledge about programming, statistics, and modeling. Therefore, custom data analysis services are becoming increasingly important in the biological sciences and can undoubtedly facilitate the research cycle.
Figure 1. Biological Data Analysis. More drawing tools and geometry operation are provided to customize biological illustrations to meet your demand. EdrawMax is an advanced all-in-one diagramming tool for creating professional flowcharts, org charts, mind maps, network diagrams, UML diagrams, floor plans, electrical diagrams, science illustrations, and more. Just try it, you will love it! True Romance from Edraw. Start Now. Super easy-to-use biology diagram software to draw biological diagrams and illustrations.
Start from high-quality biology diagram examples to facilitate your biology drawing. Free Download. It is well known that many protein-coding regions display codon bias. The nonuniform usage of codons results in different symbol statistics for different codon positions [ 12 ], and it is also a source of the period-3 property in the coding regions [ 13 ].
These properties are not observed in introns, which are not translated into amino acids. Therefore, it is important to incorporate these codon statistics when modeling protein-coding genes and building a gene-finder. The given HMM tries to capture the statistical differences in exons and introns. Each E k uses a different set of emission probabilities to reflect the symbol statistics at the k th position of a codon.
The state I is used to model the base statistics in introns. Note that this HMM can represent genes with multiple exons, where the respective exons can have variable number of codons, and the introns can also have variable lengths. This example shows that if we know the structure and the important characteristics of the biological sequences of interest, building the corresponding HMM is relatively simple and it can be done in an intuitive manner.
The constructed HMM can now be used to analyze new observation sequences. Or, if we assume that x is a protein-coding gene, how can we predict the locations of the exons and introns in the given sequence? We can answer the first question by computing the observation probability of x based on the given HMM that models coding genes. If this probability is high, it implies that this DNA sequence is likely to be a coding gene. Otherwise, we may conclude that x is unlikely to be a coding gene, since it does not contain the statistical properties that are typically observed in protein-coding genes.
The second question is about predicting the internal structure of the sequence, as it cannot be directly observed. To answer this question, we may first predict the state sequence y in the HMM that best describes x. Once we have inferred the best y , it is straightforward to predict the locations of the exons and introns. For example, assume that the optimal state sequence y is as shown in Fig. This implies that the first nine bases x As these examples show, HMMs provide a formal probabilistic framework for analyzing biological sequences.
There are three basic problems that have to be addressed in order to use HMMs in practical applications. Note that for a given x , its underlying state sequence is not directly observable and there can be many state sequences that yield x. Therefore, one way to compute the observation probability is to consider all possible state sequences y for the given x and sum up the probabilities as follows.
However, this is computationally very expensive, since there are M L possible state sequences. Instead of enumerating all possible state sequences, this algorithm defines the following forward variable.
This algorithm computes the observation probability of x with only O LM 2 computations. Therefore, the amount of time required for computing the probability increases only linearly with the sequence length L , instead of increasing exponentially. Another practically important problem is to find the optimal state sequence, or the optimal path, in the HMM that maximizes the observation probability of the given symbol sequence x , Among all possible state sequences y , we want to find the state sequence that best explains the observed symbol sequence.
This can be viewed as finding the best alignment between the symbol sequence and the HMM, hence it is sometimes called the optimal alignment problem. The Viterbi algorithm defines the variable.
Like the forward algorithm, the Viterbi algorithm finds the optimal state sequence in O LM 2 time. As we have seen, the Viterbi algorithm finds the optimal path that maximizes the observation probability of the entire symbol sequence.
In some cases, it may be more useful to find the optimal states individually for each symbol position. In this case, we can find the optimal state y n that is most likely to be the underlying state of x n as follows. The advantage of predicting the optimal states individually is that this approach will maximize the expected number of correctly predicted states.
For this reason, the Viterbi algorithm is often preferred when we are interested in inferring the optimal state sequence for the entire observation x , while the posterior-decoding approach in 14 is preferred when our interest is mainly in predicting the optimal state at a specific position. The posterior probability in 15 can also be useful for estimating the reliability of a state prediction.
The scoring problem and the alignment problem are concerned about analyzing a new observation sequence x based on the given HMM. However, the solutions to these problems are meaningful only if the HMM can properly represent the sequences of our interest. For example, they may be different speech recordings of the same word or protein sequences that belong to the same functional family. Now, the important question is how we can reasonably choose the HMM parameters based on these observations.
This is typically called the training problem. Although there is no optimal way of estimating the parameters from a limited number of finite observation sequences, there are ways to find the HMM parameters that locally maximize the observation probability [ 1 , 16 - 18 ]. Since the estimation of the HMM parameters is essentially an optimization problem, we can also use standard gradient-based techniques to find the optimal parameters of the HMM [ 17 , 18 ].
It has been demonstrated that the gradient-based method can yield good estimation results that are comparable to those of the popular EM-based method [ 18 ]. When the precise evaluation of the probability or likelihood of an observation is practically intractable for the HMM at hand, we may use simulation-based techniques to evaluate it approximately [ 17 , 19 ]. These techniques allow us to handle a much broader class of HMMs.
There are also training methods based on stochastic optimization algorithms, such as simulated annealing, that try to improve the optimization results by avoiding local maxima [ 20 , 21 ]. Currently, there exists a vast literature on estimating the parameters of hidden Markov models, and the reader is referred to [ 1 , 17 , 19 , 22 , 23 ] for further discussions.
There exist a large number of HMM variants that modify and extend the basic model to meet the needs of various applications. For example, we can add silent states i. We can also make the states emit two aligned symbols, instead of a single symbol, so that the resulting HMM simultaneously generates two related symbol sequences [ 3 , 4 , 26 ].
It is also possible to make the probabilities at certain states dependent on part of the previous emissions [ 9 , 27 ] so that we can describe more complex symbol correlations. In the following sections, we review a number of HMM variants that have been used in various biological sequence analysis problems.
Let us assume that we have a multiple sequence alignment of proteins or DNA sequences that belong to the same functional family. How can we build an HMM that can effectively represent the common patterns, motifs, and other statistical properties in the given alignment? One model that is especially useful for representing the profile of a multiple sequence alignment is the profile hidden Markov model profile-HMM [ 24 , 25 ].
A profile-HMM repetitively uses three types of hidden states, namely, match states M k , insert states I k , and delete states D k , to describe position-specific symbol frequencies, symbol insertions, and symbol deletions, respectively. To see how profile-HMMs work, let us consider the following example. Suppose we want to construct a profile-HMM based on the multiple alignment shown in Fig. Profile hidden Markov model. As we can see, the given alignment has five columns, where the base frequencies in the respective columns are different from each other.
The k th match state M k in the profile-HMM is used to describe the symbol frequencies in the k th column of the alignment. As a result, the number of match states in the resulting profile-HMM is identical to the length of the consensus sequence.
The emission probability e x M k at the k th match state M k reflects the observed symbol frequencies in the k th consensus column. By interconnecting the match states M 1 , M 2 , This ungapped HMM can represent DNA sequences that match the consensus sequence of the alignment without any gap, and it serves as the backbone of the final profile-HMM that is to be constructed.
Once we have constructed the ungapped HMM, we add insert states I k and delete states D k to the model so that we can account for insertions and deletions in new observation sequences. Let us first consider the case when the observed DNA sequence is longer than the consensus sequence of the original alignment. In this case, if we align these sequences, there will be one or more bases in the observed DNA sequence that are not present in the consensus sequence.
These additional symbols are modeled by the insert states. Now, let us consider the case when the new observed sequence is shorter than the consensus sequence. In this case, there will be one or more bases in the consensus sequence that are not present in the observed DNA sequence.
The k th delete state D k is used to handle the deletion of the k th symbol in the original consensus sequence. As delete states represent symbols that are missing, D k is a non-emitting state , or a silent state, which is simply used as a place-holder that interconnects the neighboring states. After adding the insert states and the delete states to the ungapped HMM in Fig. Estimating the parameters of a profile-HMM based on a given multiple sequence alignment is relatively simple.
We first have to decide which columns should be represented by match states and which columns should be modeled by insert states. Suppose we have a column that contains one or more gaps.
One simple rule would be to compare the number of symbols and the number of gaps. If the column has more symbols than gaps, we treat the gaps as symbol deletions. Therefore, we model the column using a match state M k for the symbols in the given column and a delete state D k for the gaps in the same column.
On the contrary, if we have more gaps than symbols, it would make more sense to view the symbols as insertions, hence we use an insert state I k to represent the column.
Once we have decided which columns should be represented by match states and which ones should be represented by insert states, we know the underlying state sequence for each symbol sequence in the alignment. Therefore, we can estimate the transition probabilities and the emission probabilities of the profile-HMM by counting the number of each state transition or symbol emission and computing their relative frequencies.
To allow small probability for state transitions or symbol emissions that are not observed in the original alignment, we can add the so-called pseudocounts to the actual counts [ 3 ]. We can also use more sophisticated methods for parameterizing the profile-HMMs. In fact, there have been considerable research efforts for optimal construction and parameterization of profile-HMMs to improve their overall performance.
More discussions on this topic can be found in [ 3 , 28 - 32 ]. Due to the convenience and effectiveness in representing sequence profiles, profile-HMMs have been widely used for modeling and analyzing biological sequences. When profile-HMMs were first proposed, they were quickly adopted for modeling the characteristics of a number of protein families, such as globins, immunoglobulins, and kinases [ 33 ].
They have been shown to be useful for various tasks, including protein classification, motif detection, and finding multiple sequence alignments. These packages provide convenient tools for applying profile-HMMs to various sequence analysis problem.
A comparison between these two popular HMM packages and an assessment of their critical features can be found in [ 32 ]. It would be also very convenient to have a library of ready-made profile-HMMs for known sequence families.
Given a profile-HMM that represents a biological sequence family, we can use it to search a sequence database to find additional homologues that belong to the same family. In a similar manner, if we have a database of pre-built profile-HMMs, we can use a single query sequence to search through the database to look for matching profiles.
This strategy can be used for classification and annotation of the given sequence. For example, by querying a new protein sequence against Pfam or PROSITE, we can find out whether the sequence contains any of the known protein domains. Products Solutions Samples Buy. This site uses cookies. Chemistry Drawing Software. The vector stencils library "Laboratory equipment" contains 31 clipart icons of chemical laboratory equipment and labware for drawing part assembly and mounting schemes of glassware apparatus in chemical experiment diagrams and illustrations.
Glass use in laboratory applications is not as commonplace as it once was because of cheaper, less breakable, plasticware; however, certain applications still require glassware because glass is relatively inert, transparent, heat-resistant, and easy to customize. The type of glass used is dependent on the application. Borosilicate glass, which is commonly used in reagent bottles, can withstand thermal stress.
0コメント