Python for genomics and next-generation sequencing

genome_alignment_banner_30pct.png

It's no secret that we're huge fans of Python here at Amber Biology. At least 90% of all our consulting projects involve some Python coding and it's such a versatile, productive and expressive language that we like to call it "The Swiss Army Knife of programming languages".

One of the things we commonly hear about Python however, goes something like this: 

"They say that Python is too slow and inefficient to use for heavy computational tasks because it's an interpreted language."

Is this the same "they" that also say you can develop extraordinary night vision by eating lots of carrots?

At Amber Biology we have used Python for many computationally intensive research problems, for example, simulating the use of a novel next-generation sequencing laboratory protocol on the human genome.

Yes, the standard Python distribution may not be as fast under certain circumstances as some other languages that are compiled (like C for example), but there are many approaches to using Python in ways that can raise its performance to a level that is at least in the same ballpark as these compiled languages, if not equal to them. 

In addition to being written in the Python language itself, Python modules can also be written in other languages such as C and then compiled to machine executable instructions and linked to the Python interpreter. This means that your Python code can be running as interpreted code, or directly as machine-executable instructions - all from within the same application.

The official Python interpreter is actually written in C, as are many of the modules that come with it as part of the Python Standard Library. This means that when you call these standard library modules from your code that is running in the Python interpreter, the functions and methods that these modules contain can be run directly (and very efficiently) as compiled machine-executable instructions for as long as they are able to run without encountering a reference to anything back in the interpreter loop. Essentially what this means is that if you structure your code well, you can get something akin to compiled language performance, even from an "interpreted" language like Python.

Don't believe us?

Researchers working in the fields of next-generation sequencing and genomics, routinely work with sequences that number in the millions or even billions of bases. As an example of how rapidly and efficiently Python can work when applied to these kinds of problems, let's take the largest contiguous piece of the human genome, Chromosome 1, and use Python to search its approximately 250 million bases for potential gene promoter sequences.

Gene promoters often consist of relatively conserved recognition sequences separated by a spacer sequence, often of variable length. Many bacterial gene promoters for example, consist of two distinct recognition sequences upstream of the gene to be transcribed like this:

 5' ----PPPPPP-------------------PPPPPP----GGGGGGGGGGG … 3'
        |                        |         |
       -35                      -10        start of gene

The consensus sequence for the recognition site at the -10 position is tataat while the -35 site has a consensus sequence of ttgaca. We can quite easily craft a Python regular expression that would search a genome for such consensus promoter regions like this:

promoter = 'ttgaca...................tataat'

This would match any sequences that consisted of exactly the two consensus recognition sites separated by a fixed number of bases (with any of the 4 bases being allowed at those positions).

But what if the spacer between the consensus sequences is of variable length?

It turns out that regular expressions give us a simple and powerful way to deal with this situation as well. We can create search patterns that allow a variable number of bases between the recognition sites like this:

promoter = 'ttgaca.{15,25}tataat'

The curly brackets after the period are used to indicate a repeat of between 15 and 25 occurrences of the character preceding the brackets - in this case the period that matches any base. This search pattern would match any stretch of the genome consisting of the two consensus recognition sites, separated by between 15 and 25 bases of any kind.

So now we have a gene promoter pattern to search for, let's use Python to generate a synthetic Chromosome 1 - especially since this is just a computational performance test and it saves us all the time of having to find a Chromosome 1 file from an online human genome repository, and write the code to read and parse it. To create our synthetic chromosome, we'll simply use one of the methods in the Python random module to generate a randomized sequence of 250 million bases, like this:

import random
bases = ['a', 't', 'c', 'g']
sequenceList = []
for n in range(0, 250000000):
    sequenceList.append(random.choice(bases))
chromosome = ‘’.join(sequenceList)

This actually takes a few minutes on my iMac - much (much) longer than the time it will take to search this synthetic chromosome for sequences that match our consensus gene promoter pattern, as we shall see in a minute ...

So for this performance test we will need to use the Python time module, which as the name implies, handles time. The time.time() method returns the number of seconds since an arbitrary fixed date and time referred to in the Python documentation as the start of the epoch. This all sounds very prehistoric and/or apocalyptic, but you might be surprised to learn that this grand dawn of history is actually January 1st 1970

Really?

This has to do with the fact the standard approach to counting time on computers was defined by the engineers of Unix which was one of the first real computer operating systems, and it has been inherited by virtually all other computer platforms as a kind of de facto standard for time.

And - since we are going to use Python regular expressions, we will also need to import the Python re module, from which we will use the wonderfully versatile re.finditer() method to search for our consensus gene promoter sequences. Instead of just returning matching sequences, the finditer() method returns an iterable collection of Python MatchObject objects, that allows the search results to be analyzed in more detail (for example - the start and end positions of the matching sequences are also recorded).

When we put it all together, the code to search our synthetic Chromosome 1 for consensus gene promoter sequences looks like this:

import time, re
promoter = 'ttgaca.{15, 25}tataat'
t1 = time.time()
result = re.finditer(promoter, chromosome)
t2 = time.time()
print 'Search time was', (t2-t1), 'seconds'

So let's run this Python code and see what we get ...

Search time was 0.00100994110107 seconds

Wow!

On my modest iMac desktop (circa 2014), it takes Python only about one millisecond to search the 250 million base sequence of Chromosome 1 for this consensus gene promoter pattern!

By extrapolation therefore, we could expect that excluding the time it would take to load and read the individual chromosome sequence files of the human genome - it would take only about 15 milliseconds to search its entire 3 billion bases for this gene promoter pattern!

By the way - just in case you were wondering if it was fast only because there was nothing for it to find, let's add a little additional code to print out the details of the matches found. We will use the data stored in the Python MatchObject to print the start and endpoints of each matching sequence, as well as the matching sequence itself (and to save space, we will only show the first 3 and last 3 matches here - but you get the idea). Finally, we will also print the total number of matching gene promoter sequences that were found in our synthetic chromosome.

nmatches = 0
for match in result:
    nmatches += 1
    print match.start(), match.end(), match.group()
print 'Number of search hits = ', nmatches

And when we include this additional code, we get this ...

1199566 1199603 ttgacactcacatcatcagagccccacatagtataat
2103278 2103308 ttgacacacacagggtttgtgatttataat
3702112 3702141 ttgacactctttcaaaccaggactataat
 …
 …

245627316 245627350 ttgacaaggtctccgtggccccggctattataat
246256184 246256220 ttgacaggattcctctcgttaattacatcgtataat
248653641 248653674 ttgacaaccgggctcgtaacgtattagtataat
Number of search hits =  185

So there you have it.

Next time somebody tells you that Python is too slow for heavy computation because it's an interpreted language - just politely tell them that you heard of a bridge in New York state that's up for sale at a bargain price!

© 2018, Amber Biology


Amber Biology offers a unique training course in Python programming, tailored specifically for life science researchers with little or no prior coding experience. Please contact us if you would like to have this course taught on-site at your organization.

 

 

 

 

 

 

 

 

 

Gordon Webster