Added exercise 2 solutions.

justinbois · Jun 20, 2024 · c0fa045 · c0fa045
1 parent c39ca20
commit c0fa045
Show file tree

Hide file tree

Showing 123 changed files with 22,266 additions and 4,689 deletions.
diff --git a/2024/_sources/exercise_solutions/exercise_2/exercise_2.1_solution.ipynb.txt b/2024/_sources/exercise_solutions/exercise_2/exercise_2.1_solution.ipynb.txt
@@ -0,0 +1,262 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exercise 2.1: Parsing a FASTA file\n",
+    "\n",
+    "<hr>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are packages, like [Biopython](http://biopython.org/) and [scikit-bio](http://scikit-bio.org) for processing files you encounter in bioinformatics. In this problem, though, we will work on our file I/O skills.  \n",
+    "\n",
+    "**a)** Use command line tools to investigate the [FASTA file](https://en.wikipedia.org/wiki/FASTA_format) located at `~/git/bootcamp/data/salmonella_spi1_region.fna`. This file contains a portion of the _Salmonella_ genome (described in [Exercise 4.1](exercise_2.3.ipynb)).\n",
+    "\n",
+    "You will notice that the first line begins with a `>`, signifying that the line contains information about the sequence. The remainder of the lines are the sequence itself.\n",
+    "\n",
+    "**b)** The format of the _Salmonella_ SPI1 region FASTA file is a common format for such files (though oftentimes FASTA files contain multiple sequences). Use the file I/O skills you have learned to write a function to read in a sequence from a FASTA file containing a single sequence (but possibly having the first line in the file beginning with `>`). Your function should take as input the name of the FASTA file and return two strings. First, it should return the descriptor string (which starts with `>`). Second, it should return a string with no gaps containing the sequence.\n",
+    "\n",
+    "Test your function on the _Salmonella_ sequence."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br />"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Solution\n",
+    "\n",
+    "<hr>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**a)** A quick look using `less` indicates that this is a valid FASTA file. I did a quick check to see how many sequences there are by searching for the `>` symbol using `grep`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000\n"
+     ]
+    }
+   ],
+   "source": [
+    "!grep \">\" data/salmonella_spi1_region.fna"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can do the same for the λ DNA file we will use in [Exercise 2.2](exercise_2.2.ipynb)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      ">Lambda_NEB\n"
+     ]
+    }
+   ],
+   "source": [
+    "!grep \">\" data/lambdaDNA.fasta"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We see that in both there is a single record. So, when we read it in, we just need to skip the first line and read on until we get the full record."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**b)** We will read in the files line by line, saving the first line as the descriptor, and then building the sequence as we go along."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "def read_fasta(filename):\n",
+    "    \"\"\"Read a sequence in from a FASTA file containing a single sequence.\n",
+    "    \n",
+    "    We assume that the first line of the file is the descriptor and all\n",
+    "    subsequent lines are sequence.    \n",
+    "    \"\"\"\n",
+    "    with open(filename, \"r\") as f:\n",
+    "        # Read in descriptor\n",
+    "        descriptor = f.readline().rstrip()\n",
+    "\n",
+    "        # Read in sequence, stripping the whitespace from each line\n",
+    "        seq = \"\"\n",
+    "        line = f.readline().rstrip()\n",
+    "        while line != \"\":\n",
+    "            seq += line\n",
+    "            line = f.readline().rstrip()\n",
+    "\n",
+    "    return descriptor, seq"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Of course, when writing this function, we should be taking a test-driven development approach, but here we are focusing on our file I/O and string handling skills.\n",
+    "\n",
+    "Let's take the function for a drive!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'AAAACCTTAGTAACTGGACTGCTGGGATTTTTCAGCCTGGATACGCTGGTAGATCTCTTCACGATGGACAGAAACTTCTTTCGGGGCGTTCACGCCAATACGCACCTGGTTGCCCTTCACCCCTAAAACTGTCACGGTGACCTCATCGCCAATCATGAGGGTCTCACCAACTCGACGAGTCAGAATCAGCATTCTTTGCTCCTTGAAAGATTAAAAGAGTCGGGTCTCTCTGTATCCCGGCATTATCCATCATATAACGCCAAAAAGTAAGCGATGACAAACACCTTAGGTGTAAGCAGTCATGGCATTACATTCTGTTAAACCTAAGTTTAGCCGATATACAAAACTTCAACCTGACTTTATCGTTGTCGATAGCGTTGACGTAAACGCCGCAGCACGGGCTGCGGCGCCAACGAACGCTTATAATTATTGCAATTTTGCGCTGACCCAGCCTTGTACACTGGCTAACGCTGCAGGCAGAGCTGCCGCATCCGTACCAC'"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "descriptor, seq = read_fasta(\"data/salmonella_spi1_region.fna\")\n",
+    "\n",
+    "# Take a look at the first 500 bases to make sure we got it right.\n",
+    "seq[:500]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Looks good!\n",
+    "\n",
+    "And for the λ-DNA...."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAACAAAAAGCAGCTGGCTGACATTTTCGGTGCGAGTATCCGTACCATTCAGAACTGGCAGGAACAGGGAATGCCCGTTCTGCGAGGCGGTGGCAAGGGTAATGAGGTGCTTTATGACTCTGCCGCCGTCATAAAATGGTATGCCGAAAGGGATGCTGAAATTGAGAACGAAAAGCTGCGCCGGGAGGTTGAAGAACTGCGGCAGGCCAGCGAGGCAGATCTCCAGCCAGGAACTATTGAGTACGAACGCCATCGACTTACGCGTGCGCAGGCCGACGCACAGGAACTGAAGAATGCCAGAG'"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "descriptor_lambda, seq_lambda = read_fasta(\"data/lambdaDNA.fasta\")\n",
+    "\n",
+    "# Take a look at the first 500 bases to make sure we got it right.\n",
+    "seq_lambda[:500]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Computing environment"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "tags": [
+     "hide-input"
+    ]
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Python implementation: CPython\n",
+      "Python version       : 3.11.3\n",
+      "IPython version      : 8.12.0\n",
+      "\n",
+      "jupyterlab: 3.6.3\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%load_ext watermark\n",
+    "%watermark -v -p jupyterlab"
+   ]
+  }
+ ],
+ "metadata": {
+  "anaconda-cloud": {},
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}