Various tools for helping to improve media search on Wikimedia Commons by using labeled data
We need a quick way to compare search algorithms without having to A/B test, so we made some scripts to do comparisons by running a searches for the search terms used to get the labeled images in the first place, then counting the labeled images in the results and calculating some metrics like precision, recall and f1score.
If you want to run this locally:
- load the existing labeled data into your database using
php jobs/Install.php --populate
(note this will delete any labeled data already in your database) php jobs/AnalyzeResults.php
... this will run default mediasearch on commons for all search terms that we have labeled data for (approx 2k), analyse the results and output them to stdout
If you want to run this using vagrant on your local machine:
- load the existing labeled data into your database using
php jobs/Install.php --populate
(note this will delete any labeled data already in your database) - update
config.ini
setting[search]baseUrl
to point at your local installation - if you want to use the replica-of-production search index in cloudelastic instead of your local search index, see instructions here
More detailed information below ...
Search commons using MediaSearch for all the search terms in ratedSearchResult
, then find all the labeled images in each resultset and store them
The point of this is to allow the user to do a search and store labeled images from the search, in order to allow the results of different searches to be compared.
Results are stored in the tables
search
Description of the searchresultset
Results for a search term for a particular searchlabeledResult
A labeled result from a resultset, with its position and rating
Params
description
A description to be stored in thesearch
table. Defaults to the date and time.searchUrl
A custom search url. Defaults to standard MediaSearch (query builder + rescore)
Analyze the (labeled) results from a particular search.
Calculates f1score, recall, average precision and precision@K over all results, and writes them to stdout.
If a search id is provided, the results for that search are analysed. If not, FindLabeledImagesInResults.php
is run first, and the results from that are analysed.
Params
description
A description to be stored in thesearch
table. Defaults to the date and time (only used ifsearchId
is not provided).searchId
The id of the stored search that we want to analyse.
A convenience script that runs jobs/AnalyzeResults.php
analysis on a pre-defined bunch of searches, and outputs the results.
Elasticsearch provides a thing called "learning to rank", where it applies machine learning to labeled data in order to improve search results ranking. See https://elasticsearch-learning-to-rank.readthedocs.io/en/latest/index.html
Elasticsearch uses a textfile in "ranklib" format as an input for model-building (which can also be used to build other models, see the section on logistic regression below).
To generate a ranklib file using the labeled data in this repo, with elasticsearch queries corresponding to those used on production commons on Jan 27 2021:
- Point
https://127.0.0.1:9243/
at a replica of the production wikimedia-commons elasticsearch index via an ssh tunnel usingssh -n -L127.0.0.1:9243:cloudelastic1001.wikimedia.org:9243 mwdebug1002.eqiad.wmnet "sleep 36000"
- Run
php createRankLib.php
The ranklib file will be output to out/MediaSearch_20211206.tsv
To create a training dataset for learning-to-rank in elasticsearch we need to:
- Create a "featureset" in elasticsearch. A featureset is basically a set of additional fields with their own query params that you tack on to a normal elasticsearch query, and you'll get back the scores for each field in your response.
- Prepare elasticsearch queries for all the labeled data you have for each search term, plus the featureset stuff tacked on.
- Run each query, and munge the responses into a format that elasticsearch can use for model-building (ranklib format).
Create an elasticsearch featureset and send it to your elasticsearch instance via a POST request (see the docs referenced above)
There are example featuresets in input/
- run
php jobs/GenerateSearchTermsWithEntitiesFile.php --outputFile="out/searchTermsWithEntities.csv
to generate a file containing all the labeled search terms and any corresponding wikidata items they might have - run
php jobs/GenerateFeatureQueries.php --queryJsonGenerator="MediaSearchSignalTest\Jobs\MediaSearch_20211206" --searchTermsWithEntitiesFile="out/searchTermsWithEntities.csv"
The query json files will be output to out/ltr/
Run php jobs/GenerateRanklibFile.php --queryDir="out/ltr/" --featuresetName=MediaSearch_20211206 --searchTermsWithEntitiesFile="out/searchTermsWithEntities.csv"
The script expects there to be an instance of elasticsearch at https://127.0.0.1:9243/
. When running the script myself I set up an ssh tunnel to cloudelastic (a replica of the live search indices that already has the featureset set up) using ssh -n -L127.0.0.1:9243:cloudelastic1001.wikimedia.org:9243 mwdebug1002.eqiad.wmnet "sleep 36000"
- if you do the same the script should just work.
The file logreg.py
trains a logistic regression model using a ranklib file, and outputs the model's coefficients and intercept. The coefficients and intercept can be then used in WikibaseMediaInfo::MediaSearchProfiles.php
.
ATM the file assumes the ranklib file lives in out/MediaSearch_20211206.tsv
- you can edit the code if you need to use a different one.
You can run the file using python3 logreg.py --trainingDataSize=N
where N is the number of rows in the ranklib file you want to use for training the model.
N is set to 0.8 by default, so running with no options means the ranklib file gets shuffled randomly, then the first (total rows * 0.8) rows are used for training the model, and the rest are used for testing it.
Running with -x
skips the shuffling step, and uses the last shuffled version.
Use this to gather search results from wikimedia commons so they can be labeled via the web app. It searches commons for each term in the input file, and inserts the results into ratedSearchResult
with rating
set to null
Params
searchTermsFile
A comma-separated file. First column is ignored, second should contain the search term, third should contain the language of the search term
A little web app where the user is presented with a random image from the stored search results, and rates it as good, bad or indifferent
A form where you can enter a search term plus language with a list of good/bad/indifferent results. It gets inserted directly into the labeled data.
- create a mysql db
- update
config.ini
to point at the right db - run
composer update
- run
php jobs/Install.php
to install the DB schema (add the--populate
param to populate the tables with existing labeled data) - away you go
A job can just be run via php jobs/<filename>
or on toolforge it can be run using jsub
If you want to use the web app for labeling images locally, you need to point a webserver at public_html
There's a replica of the production search index in cloudelastic that you can use instead of your local search index when running locally.
If you're running vagrant there are 2 steps to do this:
- Set up an ssh tunnel from your machine to cloudelastic by typing
ssh -n -L127.0.0.1:9243:cloudelastic1001.wikimedia.org:9243 mwdebug1002.eqiad.wmnet "sleep 36000"
in a console window (you'll need production access for this to work) - add the following to
LocalSettings.php
$wgCirrusSearchClusters = [
'default' => [
[
'host' => 'cloudelastic1001.wikimedia.org',
'port' => 9243,
'transport' => 'Https'
]
]
];
// Activate devel options useful for relforge
$wgCirrusSearchDevelOptions = [
'morelike_collect_titles_from_elastic' => true,
'ignore_missing_rev' => true,
];
$wgCirrusSearchIndexBaseName = 'commonswiki';
$wgCirrusSearchNamespaceMappings[ NS_FILE ] = 'file';
// Undo global config that includes commons files in other wikis search results
unset( $wgCirrusSearchExtraIndexes[ NS_FILE ] );
$wgMediaInfoMediaSearchProperties = [
'P180' => 1, //depicts
'P6243' => 1, //.1, // digital representation of
];
// stemming settings for live indices
$wgWBCSUseStemming = [
'ar' => [ 'index' => true, 'query' => true ],
'bg' => [ 'index' => true, 'query' => true ],
'ca' => [ 'index' => true, 'query' => true ],
'ckb' => [ 'index' => true, 'query' => true ],
'cs' => [ 'index' => true, 'query' => true ],
'da' => [ 'index' => true, 'query' => true ],
'de' => [ 'index' => true, 'query' => true ],
'el' => [ 'index' => true, 'query' => true ],
'en' => [ 'index' => true, 'query' => true ],
'en-ca' => [ 'index' => true, 'query' => true ],
'en-gb' => [ 'index' => true, 'query' => true ],
'es' => [ 'index' => true, 'query' => true ],
'eu' => [ 'index' => true, 'query' => true ],
'fa' => [ 'index' => true, 'query' => true ],
'fi' => [ 'index' => true, 'query' => true ],
'fr' => [ 'index' => true, 'query' => true ],
'ga' => [ 'index' => true, 'query' => true ],
'gl' => [ 'index' => true, 'query' => true ],
'he' => [ 'index' => true, 'query' => true ],
'hi' => [ 'index' => true, 'query' => true ],
'hu' => [ 'index' => true, 'query' => true ],
'hy' => [ 'index' => true, 'query' => true ],
'id' => [ 'index' => true, 'query' => true ],
'it' => [ 'index' => true, 'query' => true ],
'ja' => [ 'index' => true, 'query' => true ],
'ko' => [ 'index' => true, 'query' => true ],
'lt' => [ 'index' => true, 'query' => true ],
'lv' => [ 'index' => true, 'query' => true ],
'nb' => [ 'index' => true, 'query' => true ],
'nl' => [ 'index' => true, 'query' => true ],
'nn' => [ 'index' => true, 'query' => true ],
'pl' => [ 'index' => true, 'query' => true ],
'pt' => [ 'index' => true, 'query' => true ],
'pt-br' => [ 'index' => true, 'query' => true ],
'ro' => [ 'index' => true, 'query' => true ],
'ru' => [ 'index' => true, 'query' => true ],
'simple' => [ 'index' => true, 'query' => true ],
'sv' => [ 'index' => true, 'query' => true ],
'th' => [ 'index' => true, 'query' => true ],
'tr' => [ 'index' => true, 'query' => true ],
'uk' => [ 'index' => true, 'query' => true ],
'zh' => [ 'index' => true, 'query' => true ],
];
$wgMediaInfoExternalEntitySearchBaseUri = 'https://www.wikidata.org/w/api.php';
$wgMediaSearchExternalEntitySearchBaseUri = 'https://www.wikidata.org/w/api.php';