Skip to content

Commit 221903c

Browse files
committed
Text classification examples
1 parent 50af454 commit 221903c

File tree

9 files changed

+140
-8
lines changed

9 files changed

+140
-8
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,5 @@ humbuglog.*
1010
/data/stackoverflow_*.csv
1111
/data/crimes*.csv
1212
/data/bbc
13+
/model/*
14+
!/model/.gitkeep

README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,14 @@ Interesting demo/examples projects using `php-ml`:
88

99
* [Code Review Estimator](https://github.com/akondas/code-review-estimator) - Simple showcase of machine learning for code review cost estimation.
1010

11+
## Articles
12+
13+
Many of samples from this repository was used in my articles
14+
15+
* Text data classification with BBC news article dataset
16+
* [Clustering Chicago robberies locations with k-means algorithm](https://arkadiuszkondas.com/clustering-chicago-robberies-locations-with-k-means-algorithm/)
17+
* [Predict air pollution with k-Nearest Neighbors and PHP](https://arkadiuszkondas.com/predict-air-pollution-with-k-nearest-neighbors-and-php/)
18+
1119
## Examples
1220

1321
To test example, select one of the following and run it from main folder (each category has its own folder).
@@ -21,6 +29,7 @@ Classification:
2129
* `languageDetection.php` - classifier build for language detection
2230
* `minst.php` - recognize handwritten digits from MNIST dataset (to download dataset use `bin/download-mnist.sh`)
2331
* `spamFilter.php` - simple spam filter with example dataset
32+
* `bbc.php` - example of text classification
2433

2534
Regression:
2635

classification/bbc.php

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
<?php
2+
3+
declare(strict_types=1);
4+
5+
namespace PhpmlExamples;
6+
7+
use Phpml\Classification\SVC;
8+
use Phpml\CrossValidation\StratifiedRandomSplit;
9+
use Phpml\Dataset\FilesDataset;
10+
use Phpml\FeatureExtraction\StopWords\English;
11+
use Phpml\FeatureExtraction\TfIdfTransformer;
12+
use Phpml\FeatureExtraction\TokenCountVectorizer;
13+
use Phpml\Metric\Accuracy;
14+
use Phpml\Tokenization\NGramTokenizer;
15+
16+
include 'vendor/autoload.php';
17+
18+
$dataset = new FilesDataset(__DIR__.'/../data/bbc');
19+
$split = new StratifiedRandomSplit($dataset, 0.3);
20+
21+
$samples = $split->getTrainSamples();
22+
23+
$vectorizer = new TokenCountVectorizer(new NGramTokenizer(1, 3), new English());
24+
$vectorizer->fit($samples);
25+
$vectorizer->transform($samples);
26+
27+
$transformer = new TfIdfTransformer();
28+
$transformer->fit($samples);
29+
$transformer->transform($samples);
30+
31+
$classifier = new SVC();
32+
$classifier->train($samples, $split->getTrainLabels());
33+
34+
35+
$testSamples = $split->getTestSamples();
36+
$vectorizer->transform($testSamples);
37+
$transformer->transform($testSamples);
38+
39+
$predicted = $classifier->predict($testSamples);
40+
41+
echo 'Accuracy: ' . Accuracy::score($split->getTestLabels(), $predicted);

classification/bbcPipeline.php

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
<?php
2+
3+
declare(strict_types=1);
4+
5+
namespace PhpmlExamples;
6+
7+
use Phpml\Classification\SVC;
8+
use Phpml\CrossValidation\StratifiedRandomSplit;
9+
use Phpml\Dataset\FilesDataset;
10+
use Phpml\FeatureExtraction\StopWords\English;
11+
use Phpml\FeatureExtraction\TfIdfTransformer;
12+
use Phpml\FeatureExtraction\TokenCountVectorizer;
13+
use Phpml\Metric\Accuracy;
14+
use Phpml\ModelManager;
15+
use Phpml\Pipeline;
16+
use Phpml\SupportVectorMachine\Kernel;
17+
use Phpml\Tokenization\NGramTokenizer;
18+
19+
include 'vendor/autoload.php';
20+
21+
$dataset = new FilesDataset(__DIR__.'/../data/bbc');
22+
$split = new StratifiedRandomSplit($dataset, 0.1);
23+
24+
25+
$pipeline = new Pipeline([
26+
new TokenCountVectorizer($tokenizer = new NGramTokenizer(1, 3), new English()),
27+
new TfIdfTransformer()
28+
], new SVC(Kernel::LINEAR));
29+
30+
$start = microtime(true);
31+
$pipeline->train($split->getTrainSamples(), $split->getTrainLabels());
32+
$stop = microtime(true);
33+
34+
$predicted = $pipeline->predict($split->getTestSamples());
35+
36+
echo 'Train: ' . round($stop - $start, 4) . 's'. PHP_EOL;
37+
echo 'Estimator: ' . get_class($pipeline->getEstimator()) . PHP_EOL;
38+
echo 'Tokenizer: ' . get_class($tokenizer) . PHP_EOL;
39+
echo 'Accuracy: ' . Accuracy::score($split->getTestLabels(), $predicted);
40+
41+
$modelManager = new ModelManager();
42+
$modelManager->saveToFile($pipeline, __DIR__.'/../model/bbc-nb.phpml');

classification/bbcRestored.php

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
<?php
2+
3+
declare(strict_types=1);
4+
5+
namespace PhpmlExamples;
6+
7+
use Phpml\ModelManager;
8+
9+
include 'vendor/autoload.php';
10+
11+
$start = microtime(true);
12+
$modelManager = new ModelManager();
13+
$model = $modelManager->restoreFromFile(__DIR__.'/../model/bbc.phpml');
14+
$total = microtime(true) - $start;
15+
16+
echo sprintf('Model loaded in %ss', round($total, 4)) . PHP_EOL;
17+
18+
$text = 'The future of the games industry, at least as Google sees it, is in streaming.
19+
It’s a trend that feels inevitable - just ask anyone in the music, TV or film business. Streaming is where it\'s at, and the possibility for what can be streamed has only ever been bound by the limitations of internet connectivity.
20+
Google thinks its technology can make streaming games a plausible and possibly even pleasurable reality. One where gamers aren’t driven to insanity by stuttering gameplay and slow-reacting characters.
21+
For the sake of argument, let’s assume it succeeds. Where might Google - with its track record for upending business models, often with unintended consequences - lead the industry?
22+
Shifting costs
23+
Games consoles are expensive. The games are (mostly) expensive.
24+
Google’s Stadia could eliminate both costs, replacing them with a subscription fee. A ballpark figure might be $15-$30 a month - though some predict big name titles might have an additional fee on top, like buying a new movie on Amazon Prime Video.
25+
Good news? It depends on where you’re coming from.
26+
For gamers, there are a number of hurdles. Phil Harrison, Google’s man in charge of Stadia, told me his team\'s tests managed 4K gaming on download speeds of “around 25mbps”.
27+
For context, Microsoft currently suggests a minimum of just 3mbps to play “traditional” games online. And the difference between getting 3mbps and 25mbps? Hundreds of dollars a year in payments to your internet service provider.
28+
Or, the difference could be not being able to play at all - 25mbps is more than double the average connection speed across the US, according to research commissioned and part-funded by, er, Google.';
29+
30+
31+
$start = microtime(true);
32+
33+
$predicted = $model->predict([$text])[0];
34+
$total = microtime(true) - $start;
35+
36+
echo sprintf('Predicted category: %s in %ss', $predicted, round($total, 6)) . PHP_EOL;

composer.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
}
1818
],
1919
"require": {
20-
"php-ai/php-ml": "0.7.0"
20+
"php-ai/php-ml": "0.8.0"
2121
},
2222
"require-dev": {
2323
"friendsofphp/php-cs-fixer": "^2.14"

composer.lock

Lines changed: 6 additions & 6 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

model/.gitkeep

Whitespace-only changes.

regression/wineQuality.php

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
<?php declare(strict_types=1);
1+
<?php
2+
3+
declare(strict_types=1);
24

35
namespace PhpmlExamples;
46

0 commit comments

Comments
 (0)