Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 6 additions & 100 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,102 +1,8 @@
# Classify code snippets into programming languages
All appropriate files are included in the directory "plc". To have my classifier determine what language a code snippet was written in, do the following:
1. Copy your code snippet into the file "code_snippet.txt" or make a .txt file of a different name with code snippet inside
2. run "python guess_lang.py code_snippet.txt" from the command line. (If you've named a file differently than code_snippet.txt then use that name instead.)
The program will tell you what language it thinks it is, as well as a percentage of how confident it is in that assessment.

## Description

Create a classifier that can take snippets of code and guess the programming language of the code.

## Objectives

### Learning Objectives

After completing this assignment, you should understand:

* Feature extraction
* Classification
* The varied syntax of programming languages

### Performance Objectives

After completing this assignment, you should be able to:

* Build a robust classifier

## Details

### Deliverables

* A Git repo called programming-language-classifier containing at least:
* `README.md` file explaining how to run your project
* a `requirements.txt` file
* a suite of tests for your project

### Requirements

* Passing unit tests
* No PEP8 or Pyflakes warnings or errors

## Normal Mode

### Getting a corpus of programming languages

Option 1: Get code from the [Computer Language Benchmarks Game](http://benchmarksgame.alioth.debian.org/). You can [download their code](https://alioth.debian.org/snapshots.php?group_id=100815) directly. In the downloaded archive under `benchmarksgame/bench`, you'll find many directories with short programs in them. Using the file extensions of these files, you should be able to find out what programming language they are.

Option 2: Scrape code from [Rosetta Code](http://rosettacode.org/wiki/Rosetta_Code). You will need to figure out how to scrape HTML and parse it. [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) is your best bet for doing that.

Option 3: Get code from GitHub somehow. The specifics of this are left up to you.

You are allowed to use other code samples as well.

**For your sanity, you only have to worry about the following languages:**

* Clojure
* Haskell
* Java
* JavaScript
* OCaml
* Perl
* PHP
* Python
* Ruby
* Scala
* Scheme
* Tcl

Feel more than free to add others!

### Classifying new snippets

Using your corpus, you should extract features for your classifier. Use whatever classifier engine that works best for you _and that you can explain how it works._

Your initial classifier should be able to take a string containing code and return a guessed language for it. It is recommended you also have a method that returns the snippet's percentage chance for each language in a dict.

### Testing your classifier

The `test/` directory contains code snippets. The file `test.csv` contains a list of the file names in the `test` directory and the language of each snippet. Use this set of snippets to test your classifier. _Do not use the test snippets for training your classifier, obviously._

### Code layout

This project should be laid out in accordance with the project layout from _The Hacker's Guide to Python_. It should have tests for things which can be tested. Your classifier should be able to be run with a small controlled corpus for testing.

Your project should also contain an IPython notebook that demonstrates use of your classifier.

## Hard Mode

In addition to the requirements from **Normal Mode**:

Create a runnable Python file that can classify a snippet in a text file, run like this:

`guess_lang.py code-snippet.txt`

where `guess_lang.py` is whatever you name your program and `code-snippet.txt` is any snippet. Your program should print out the language it thinks the snippet is.

To do this, you will likely want to either pre-parse your corpus and output it as features to load or save out your classifier for later use. Otherwise, you'll have to read your entire corpus every time you run the program. That's acceptable, but slow.

You may want to add some command-line flags to your program. You could allow people to choose the corpus, for example, or to get percentage chances instead of one language. To understand how to write a command-line program with arguments and flags, see the [argparse](https://docs.python.org/3/library/argparse.html) module in the standard library.

## Additional Resources

* [TextBlob](http://textblob.readthedocs.org/en/dev/)
* [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/)
* [Rosetta Code](http://rosettacode.org/wiki/Rosetta_Code)
* [Working with Text Files](https://opentechschool.github.io/python-data-intro/core/text-files.html)

The IPython notebook "Code Classifier.ipynb" demonstrates how the classifier is built and runs it against the test codes found in the "test" directory. Supplemental functions are found in modules referenced by the imports at the start of the notebook.
An additional IPython notebook "split and test.ipynb" contains a classifier trained on a portion of the Benchmarks game code and tested on the remainder of the code.
Loading