Article From:https://www.cnblogs.com/lokwongho/p/9684131.html

Chapter1 WHERE IN THE GENOME DOES DNA REPLICATION BEGIN

1.

·Polymerase priming domains bind to sites in the upstream sequence that are multiple, specific, and distributed across two chains. The highest frequency of k-mer may be polymerase binding site: dnaA BOX.

But how to locate the approximate location of Ori?

·DNAThe asymmetry of chain replication, which leads to the asymmetry of mutation rate, leads to the trend of (forward chain C-> T, deamination). Thus, according to the increase of skew in the forward chain, the skew minus is in the reverse chain. (skew = G -C, every G+1 meets C-1) the lowest point in the graph represents the Ori region.

From this we can roughly infer the ori location, and then within this location (100 bp), look for patterns that appear frequently as possible DNA boxes.

·Since there are differences between bases in k-mer, a fault tolerant counting method should be used.

Two.

Put forward a questionThe Clump Finding Problem

Find every k-mer that forms a clump in the genome.

`ComputingFrequencies(Text, k) #A bucket method traversing the computation frequency onceFor I 0 to 4K 1FrequencyArray (I) 0For I to 0|Text| - KPattern Text (I, K)J PatternToNumber (Pattern) #hashFrequencyArray (J) FrequencyArray (J) + 1Return FrequencyArray`

ComputingFrequencies(Text, k)This is a method to calculate the frequency of KMER.

`FindingFrequentWordsBySorting(Text , k) #Sorting methodFrequentPatterns an empty setFor I to 0 |Text| KPattern TeXT (I, K)Index (I) PatternToNumber (Pattern)Count (I) 1SortedIndeX Sort (Index)For I to 1 |Text| KIf SortedIndex (I) = SortedIndex (I 1)Count (I) = Count (I + 1) + 1MaxCount maximum value in the array CountFor I to 0 |Text| KIf Count (I) = maxCountPattern NumberToPattErn (SortedIndex (I), K)Add Pattern to the set FrequentPatternsReturn FrequeNtPatterns`

FindingFrequentWordsBySorting(Text , k)This is another way to calculate the KMER frequency.

A fault-tolerant frequency calculation method
```ClumpFinding(Genome, k, L, t)
FrequentPatterns ← an empty set
for i ← 0 to 4k − 1
Clump(i) ← 0
for i ← 0 to |Genome| − L
Text ← the string of length L starting at position i in Genome
FrequencyArray ← ComputingFrequencies(Text, k)
for index ← 0 to 4k − 1
if FrequencyArray(index) ≥ t
Clump(index) ← 1
for i ← 0 to 4k − 1
if Clump(i) = 1
Pattern ← NumberToPattern(i, k)
add Pattern to the set FrequentPatterns
return FrequentPatterns```

We don’t have to recalculate the KMER frequency every time we move a search window. For every move in the search window, the first KMER will be one less, and the last KMER will be one more.

```BetterClumpFinding(Genome, k, t, L)
FrequentPatterns ← an empty set
for i ← 0 to 4k − 1
Clump(i) ← 0
Text ← Genome(0, L)
FrequencyArray ← ComputingFrequencies(Text, k)
for i ← 0 to 4k − 1
if FrequencyArray(i) ≥ t
Clump(i) ← 1
for i ← 1 to |Genome| − L
FirstPattern ← Genome(i − 1, k)
index ← PatternToNumber(FirstPattern)
FrequencyArray(index) ← FrequencyArray(index) − 1
LastPattern ← Genome(i + L − k, k)
index ← PatternToNumber(LastPattern)
FrequencyArray(index) ← FrequencyArray(index) + 1
if FrequencyArray(index) ≥ t
Clump(index) ← 1
for i ← 0 to 4k − 1
if Clump(i) = 1
Pattern ← NumberToPattern(i, k)
add Pattern to the set FrequentPatterns
return FrequentPatterns```