Article From:https://www.cnblogs.com/lokwongho/p/9682667.html

Chapter2 WHICH DNA PATTERNS PLAY THE ROLE OF MOLECULAR CLOCKS

Finding the order of modules

1.

Transcription factors regulate the transcriptional expression of genes by binding to specific sequences upstream of genes, but this sequence varies from individual to individual. This chapter describes greedy and stochastic algorithms for finding this sequence: finding the order of modules.

Two, some concepts:

1. Score、Profile The meaning is as follows

According to profile matrix, you can calculate a KMER in a pro.fileLower probability

Three.

Questions raised:Motif Finding Problem:

Given a collection of strings, find a set of k-mers, one from each string, that minimizes the score of the resulting motif.

Input: A collection of strings Dna and an integer k.

Output: A collection Motifs of k-mers, one from each string in Dna, minimizing Score(Motifs) among all possible choices of k-mers.

In a set of sequences, look for a group of k-mer whose Score is the lowest.（Or the minimum sum of Hamming distance from consensus sequence.

1 ergodic

```MedianString(Dna, k)
distance ← ∞
for each k-mer Pattern from AA…AA to TT…TT
if distance > d(Pattern, Dna)
distance ← d(Pattern, Dna)
Median ← Pattern
return Median```

2 Greedy method GreedyMotifSearch

```GREEDYMOTIFSEARCH(Dna, k, t)
BestMotifs ← motif matrix formed by first k-mers in each string
from Dna
for each k-mer Motif in the first string from Dna
Motif1 ← Motif
for i = 2 to t
form Profile from motifs Motif1, …, Motifi - 1
Motifi ← Profile-most probable k-mer in the i-th string
in Dna
Motifs ← (Motif1, …, Motift)
if Score(Motifs) < Score(BestMotifs)
BestMotifs ← Motifs
output BestMotifs```

Detailed explanation of http://www.mrgraeme.co.uk/greedy-motif-search/

*Greedy method GreedyMotifSearch with pseudocounts

pseudocounts：When profile matrix is formed, the 0 term is set to a smaller value.

```GreedyMotifSearch(Dna, k, t)
form a set of k-mers BestMotifs by selecting 1st k-mers in each string from Dna
for each k-mer Motif in the first string from Dna
Motif1 ← Motif
for i = 2 to t
apply Laplace's Rule of Succession to form Profile from motifs   Motif1, …, Motifi-1
Motifi ← Profile-most probable k-mer in the i-th string in Dna
Motifs ← (Motif1, …, Motift)
if Score(Motifs) < Score(BestMotifs)
BestMotifs ← Motifs
output BestMotifs```

3. Stochastic method Randomized Motif Search

`RandomizedMotifSearch(Dna, k, t)　　　　　#Randomly fetching k-mer from each DNA, generating a set of motifsRandomly select k-mers Motifs = (Motif1,... (Motift) in each stringFrom DnaBestMotifs MotifsWhile foreverProfile (Profile) (Motifs)Motifs forms Profile matrixMotifs (Motifs, Profile (Dna)) generates a set of motifs with the greatest probability from a set of DNA based on profile matrix.If Score (Motifs) < Score (BestMotifs)BestMotifs MotifsElseReturn BestMotifs`

The reason that stochastic algorithms work is that a randomly selected set of Motifs may pick a potentially correct k-mer and then form a tilt in it until a better solution is found.

Improvement: In the previous algorithm, a new set of Motifs is generated randomly at each iteration, which may discard the potential correct module order by changing only one line at a time at random

```GibbsSampler(Dna, k, t, N)
randomly select k-mers Motifs = (Motif1, …, Motift) in each string from Dna
BestMotifs ← Motifs
for j ← 1 to N
i ← Random(t)
Profile ← profile matrix constructed from all strings in Motifs except for Motif[i]
Motif[i] ← Profile-randomly generated k-mer in the i-th sequence
if Score(Motifs) < Score(BestMotifs)
BestMotifs ← Motifs
return BestMotifs```