Article From:https://www.cnblogs.com/lokwongho/p/9684131.html

Chapter1 WHERE IN THE GENOME DOES DNA REPLICATION BEGIN 

 

 

1.

·Polymerase priming domains bind to sites in the upstream sequence that are multiple, specific, and distributed across two chains. The highest frequency of k-mer may be polymerase binding site: dnaA BOX.

But how to locate the approximate location of Ori?

·DNAThe asymmetry of chain replication, which leads to the asymmetry of mutation rate, leads to the trend of (forward chain C-> T, deamination). Thus, according to the increase of skew in the forward chain, the skew minus is in the reverse chain. (skew = G -C, every G+1 meets C-1) the lowest point in the graph represents the Ori region.

From this we can roughly infer the ori location, and then within this location (100 bp), look for patterns that appear frequently as possible DNA boxes.

·Since there are differences between bases in k-mer, a fault tolerant counting method should be used.

 

E. coli G-C skew

Two.

Put forward a questionThe Clump Finding Problem 

Find every k-mer that forms a clump in the genome. 

ComputingFrequencies(Text, k) #A bucket method traversing the computation frequency onceFor I 0 to 4K 1FrequencyArray (I) 0For I to 0|Text| - KPattern Text (I, K)J PatternToNumber (Pattern) #hashFrequencyArray (J) FrequencyArray (J) + 1Return FrequencyArray

ComputingFrequencies(Text, k)This is a method to calculate the frequency of KMER.

FindingFrequentWordsBySorting(Text , k) #Sorting methodFrequentPatterns an empty setFor I to 0 |Text| KPattern TeXT (I, K)Index (I) PatternToNumber (Pattern)Count (I) 1SortedIndeX Sort (Index)For I to 1 |Text| KIf SortedIndex (I) = SortedIndex (I 1)Count (I) = Count (I + 1) + 1MaxCount maximum value in the array CountFor I to 0 |Text| KIf Count (I) = maxCountPattern NumberToPattErn (SortedIndex (I), K)Add Pattern to the set FrequentPatternsReturn FrequeNtPatterns

FindingFrequentWordsBySorting(Text , k)This is another way to calculate the KMER frequency.

A fault-tolerant frequency calculation method
ClumpFinding(Genome, k, L, t)
        FrequentPatterns ← an empty set
        for i ← 0 to 4k − 1
            Clump(i) ← 0
        for i ← 0 to |Genome| − L
            Text ← the string of length L starting at position i in Genome 
            FrequencyArray ← ComputingFrequencies(Text, k)
            for index ← 0 to 4k − 1
                if FrequencyArray(index) ≥ t
                    Clump(index) ← 1
        for i ← 0 to 4k − 1
            if Clump(i) = 1
                Pattern ← NumberToPattern(i, k)
                add Pattern to the set FrequentPatterns
        return FrequentPatterns

We don’t have to recalculate the KMER frequency every time we move a search window. For every move in the search window, the first KMER will be one less, and the last KMER will be one more.

BetterClumpFinding(Genome, k, t, L)
        FrequentPatterns ← an empty set
        for i ← 0 to 4k − 1
            Clump(i) ← 0
        Text ← Genome(0, L)
        FrequencyArray ← ComputingFrequencies(Text, k)
        for i ← 0 to 4k − 1
            if FrequencyArray(i) ≥ t
                Clump(i) ← 1
        for i ← 1 to |Genome| − L
            FirstPattern ← Genome(i − 1, k)
            index ← PatternToNumber(FirstPattern)
            FrequencyArray(index) ← FrequencyArray(index) − 1
            LastPattern ← Genome(i + L − k, k)
            index ← PatternToNumber(LastPattern)
            FrequencyArray(index) ← FrequencyArray(index) + 1
            if FrequencyArray(index) ≥ t
                Clump(index) ← 1
        for i ← 0 to 4k − 1
            if Clump(i) = 1
                Pattern ← NumberToPattern(i, k)
                add Pattern to the set FrequentPatterns
        return FrequentPatterns

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *