diff --git a/assignment-2018-09-21.ipynb b/assignment-2018-09-21.ipynb
deleted file mode 100644
index 9525581..0000000
--- a/assignment-2018-09-21.ipynb
+++ /dev/null
@@ -1,281 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Assignment 1: Analyzing Stack Overflow Data"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Introduction "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In this assignment, we will look at [Stack Overflow](https://stackoverflow.com/) post data from the year of 2015 and measure the similarity of users by looking at the types of questions they answer. Do not delete the output of your code cells. This assignment must be completed **individually** by each student."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Submission "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Please use the following invitation link to create your assignment repository for this assignment: [https://classroom.github.com/a/epLjOcUA](https://classroom.github.com/a/epLjOcUA). Include your BU username within your submission by adding it here: **\n"
- ]
- }
- ],
- "source": [
- "import requests\n",
- "from datetime import datetime\n",
- "\n",
- "start_time = 1420070400 # 01-01-2015 at 00:00:00\n",
- "end_time = 1420156800 # 01-02-2015 at 00:00:00\n",
- "\n",
- "response = requests.get(\"https://api.stackexchange.com/2.2/questions?pagesize=100\" +\n",
- " \"&fromdate=\" + str(start_time) + \"&todate=\" + str(end_time) +\n",
- " \"&order=asc&sort=creation&site=stackoverflow\")\n",
- "\n",
- "print(response) # Displays the HTTP response code (should be 200 for success)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "All dates in the Stack Exchange API are in [unix epoch time](https://en.wikipedia.org/wiki/Unix_time). The format for the request string is specified [here](https://api.stackexchange.com/docs/questions).\n",
- "\n",
- "We can try to print the response that Stack Exchange returns."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "print(response.text)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The raw response is not very useful for automated processing. Instead, we can decode the raw response as JSON and then use the `json` library to print it."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "import json\n",
- "\n",
- "json_response = response.json()\n",
- "\n",
- "print(json.dumps(json_response, indent=2))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "It is now possible to see that the response consists of a list of question items. For each of these items, we get information about its attributes: `creation_date`, `answer_count`, `owner`, `title`, and so on.\n",
- "\n",
- "Notice that `has_more` is `true`. To get more items, we can [request the next page](https://api.stackexchange.com/docs/paging)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Problem 1: Parsing the responses"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In this problem you will practice using some of the techniques and string handling methods that Python offers. Our goal is to extract the interesting parts of the response data and transform them into a format that will be useful for our final analysis."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Part A (8 points):** We will first isolate the `creation_date` attribute in the response. Complete the definition of the ```print_creation_dates_json()``` below, which reads the response and prints the creation dates."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [],
- "source": [
- "def print_creation_dates_json(response):\n",
- " \"\"\"\n",
- " Prints the creation_date of all the questions in the response.\n",
- " \n",
- " Parameters:\n",
- " response: Response object\n",
- " \"\"\"\n",
- " "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Part B (8 points):** Write code that issues requests to retrieve all of the creation dates of questions posted on the first day in 2015. The code should call the ```print_creation_dates_json()``` function to print out each page of results.\n",
- "\n",
- "**Hint:** You can use a loop and take advantage of the `has_more` attribute to request additional pages of results if they exist. Please be aware of Stack Exchange's [rate limit](https://api.stackexchange.com/docs/throttle); you can use the Python `sleep` function in the `time` module ([documentation can be found here](https://docs.python.org/3/library/time.html#time.sleep)) to add a delay between requests."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Due to time constraints, we have already downloaded the [data dump](http://cs-people.bu.edu/lapets/506/data/stackoverflow-posts-2015.tar.gz) for Stack Overflow's posts in 2015. Note that the XML file is 10GB in size unzipped. If you don't have space on your computer, you can download it into `/scratch` on one of the machines in the undergrad lab, or you can download it onto a USB drive. You may want to work with a subset of this data at first, but your solution should be efficient enough to work with the whole dataset. For example, if you call `read()` on the whole dataset, you will get a `MemoryError`.\n",
- "\n",
- "Do not commit the data file to your repository. You may assume that we will place the data file in the same directory as your submitted notebook file, so use a relative path when loading the data file."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Part C (16 points):** Write a function that parses out the questions posted in 2015. These are posts with `PostTypeId=1`. The function should return a pandas `DataFrame` object with 4 columns: `Id`, `CreationDate`, `OwnerUserId`, and the first tag in `Tags`. Call the function on an appropriate input and print out the `DataFrame` object; do not clear the output.\n",
- "\n",
- "**Hint:** You should be able to use `iterparse` ([documentation can be found here](https://docs.python.org/3.8/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse)):\n",
- "```\n",
- "from lxml.etree import iterparse\n",
- "```\n",
- "Once you create the `iterparse` object (let's call it `parsed`) for the file, you can use a `for` loop such as:\n",
- "```\n",
- "for _, element in parsed:\n",
- " # ...\n",
- "```\n",
- "You can use the `.tag` and `.get()` methods of the `element` object to inspect it and extract data from it."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Problem 2: Analyzing the responses"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Part A (50 points):** Write a function that measures the similarity of the top 100 users with the most answer posts. Compare the users based on the types of questions they answer. We will categorize a question by its first tag. You may choose to implement any one of the similarity/distance measures we discussed in class.\n",
- "\n",
- "Note that answers are posts with `PostTypeId=2`. The identifier of the question in answer posts is the `ParentId`.\n",
- "\n",
- "You may find the [sklearn.feature_extraction](http://scikit-learn.org/stable/modules/feature_extraction.html) module helpful.\n",
- "\n",
- "**Hint:** You may want to begin your solution by constructing a data set in the following way:\n",
- "1. Find users with the most responses using `.groupby()` on a data frame containing all of the entries in the answers data set.\n",
- "2. Sort to find the top 100 users from the result above.\n",
- "3. Find all the question identifiers (using `ParentId`) among the answers of the top 100 users (within the answers data set).\n",
- "4. Join/merge the result above with the questions data set to get the tags of the questions answered by the top 100 users."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Part B (18 points):** Plot the distance of the top 100 users using a [heatmap](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.heatmap.html)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "anaconda-cloud": {},
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.4"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 1
-}