Skip to content

Commit

Permalink
Added exercise 2 solutions.
Browse files Browse the repository at this point in the history
  • Loading branch information
justinbois committed Jun 20, 2024
1 parent c39ca20 commit c0fa045
Show file tree
Hide file tree
Showing 123 changed files with 22,266 additions and 4,689 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 2.1: Parsing a FASTA file\n",
"\n",
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are packages, like [Biopython](http://biopython.org/) and [scikit-bio](http://scikit-bio.org) for processing files you encounter in bioinformatics. In this problem, though, we will work on our file I/O skills. \n",
"\n",
"**a)** Use command line tools to investigate the [FASTA file](https://en.wikipedia.org/wiki/FASTA_format) located at `~/git/bootcamp/data/salmonella_spi1_region.fna`. This file contains a portion of the _Salmonella_ genome (described in [Exercise 4.1](exercise_2.3.ipynb)).\n",
"\n",
"You will notice that the first line begins with a `>`, signifying that the line contains information about the sequence. The remainder of the lines are the sequence itself.\n",
"\n",
"**b)** The format of the _Salmonella_ SPI1 region FASTA file is a common format for such files (though oftentimes FASTA files contain multiple sequences). Use the file I/O skills you have learned to write a function to read in a sequence from a FASTA file containing a single sequence (but possibly having the first line in the file beginning with `>`). Your function should take as input the name of the FASTA file and return two strings. First, it should return the descriptor string (which starts with `>`). Second, it should return a string with no gaps containing the sequence.\n",
"\n",
"Test your function on the _Salmonella_ sequence."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br />"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Solution\n",
"\n",
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**a)** A quick look using `less` indicates that this is a valid FASTA file. I did a quick check to see how many sequences there are by searching for the `>` symbol using `grep`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
">gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000\n"
]
}
],
"source": [
"!grep \">\" data/salmonella_spi1_region.fna"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can do the same for the λ DNA file we will use in [Exercise 2.2](exercise_2.2.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
">Lambda_NEB\n"
]
}
],
"source": [
"!grep \">\" data/lambdaDNA.fasta"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that in both there is a single record. So, when we read it in, we just need to skip the first line and read on until we get the full record."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**b)** We will read in the files line by line, saving the first line as the descriptor, and then building the sequence as we go along."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def read_fasta(filename):\n",
" \"\"\"Read a sequence in from a FASTA file containing a single sequence.\n",
" \n",
" We assume that the first line of the file is the descriptor and all\n",
" subsequent lines are sequence. \n",
" \"\"\"\n",
" with open(filename, \"r\") as f:\n",
" # Read in descriptor\n",
" descriptor = f.readline().rstrip()\n",
"\n",
" # Read in sequence, stripping the whitespace from each line\n",
" seq = \"\"\n",
" line = f.readline().rstrip()\n",
" while line != \"\":\n",
" seq += line\n",
" line = f.readline().rstrip()\n",
"\n",
" return descriptor, seq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course, when writing this function, we should be taking a test-driven development approach, but here we are focusing on our file I/O and string handling skills.\n",
"\n",
"Let's take the function for a drive!"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'AAAACCTTAGTAACTGGACTGCTGGGATTTTTCAGCCTGGATACGCTGGTAGATCTCTTCACGATGGACAGAAACTTCTTTCGGGGCGTTCACGCCAATACGCACCTGGTTGCCCTTCACCCCTAAAACTGTCACGGTGACCTCATCGCCAATCATGAGGGTCTCACCAACTCGACGAGTCAGAATCAGCATTCTTTGCTCCTTGAAAGATTAAAAGAGTCGGGTCTCTCTGTATCCCGGCATTATCCATCATATAACGCCAAAAAGTAAGCGATGACAAACACCTTAGGTGTAAGCAGTCATGGCATTACATTCTGTTAAACCTAAGTTTAGCCGATATACAAAACTTCAACCTGACTTTATCGTTGTCGATAGCGTTGACGTAAACGCCGCAGCACGGGCTGCGGCGCCAACGAACGCTTATAATTATTGCAATTTTGCGCTGACCCAGCCTTGTACACTGGCTAACGCTGCAGGCAGAGCTGCCGCATCCGTACCAC'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"descriptor, seq = read_fasta(\"data/salmonella_spi1_region.fna\")\n",
"\n",
"# Take a look at the first 500 bases to make sure we got it right.\n",
"seq[:500]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks good!\n",
"\n",
"And for the λ-DNA...."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAACAAAAAGCAGCTGGCTGACATTTTCGGTGCGAGTATCCGTACCATTCAGAACTGGCAGGAACAGGGAATGCCCGTTCTGCGAGGCGGTGGCAAGGGTAATGAGGTGCTTTATGACTCTGCCGCCGTCATAAAATGGTATGCCGAAAGGGATGCTGAAATTGAGAACGAAAAGCTGCGCCGGGAGGTTGAAGAACTGCGGCAGGCCAGCGAGGCAGATCTCCAGCCAGGAACTATTGAGTACGAACGCCATCGACTTACGCGTGCGCAGGCCGACGCACAGGAACTGAAGAATGCCAGAG'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"descriptor_lambda, seq_lambda = read_fasta(\"data/lambdaDNA.fasta\")\n",
"\n",
"# Take a look at the first 500 bases to make sure we got it right.\n",
"seq_lambda[:500]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing environment"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Python implementation: CPython\n",
"Python version : 3.11.3\n",
"IPython version : 8.12.0\n",
"\n",
"jupyterlab: 3.6.3\n",
"\n"
]
}
],
"source": [
"%load_ext watermark\n",
"%watermark -v -p jupyterlab"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading

0 comments on commit c0fa045

Please sign in to comment.