mirror of
https://github.com/ArthurIdema/Zoekeend-Phrase-Indexing.git
synced 2025-10-26 16:24:21 +00:00
| helper_scripts | ||
| ngram_queries | ||
| queries | ||
| .gitignore | ||
| batch_search_eval.sh | ||
| cranfield_qrels.tsv | ||
| cranfield_queries_with_2_bigrams.tsv | ||
| cranfield_queries.tsv | ||
| filter_queries_by_ngrams.py | ||
| lm.sql | ||
| ngrams.csv | ||
| output_new_postings_bigrams.csv | ||
| output_new_postings_half1.csv | ||
| output_new_postings_half2.csv | ||
| output_new_postings.csv | ||
| output.csv | ||
| phrase_index.py | ||
| phrases_extractor.py | ||
| README.md | ||
| ze_eval.py | ||
| ze_index_export.py | ||
| ze_index_import.py | ||
| ze_index.py | ||
| ze_reindex_const.py | ||
| ze_reindex_fitted.py | ||
| ze_reindex_group.py | ||
| ze_reindex_prior.py | ||
| ze_search.py | ||
| ze_vacuum.py | ||
| zoekeend | ||
How to use
Run python3 phrase_index.py with any of the parameters listed below:
-h, --help show this help message and exit
--db DB Database file name
--dataset DATASET ir_datasets name (e.g., cranfield, msmarco-passage)
--stopwords STOPWORDS Stopwords to use (english, none)
--mode MODE Indexing mode (duckdb, phrases)
--min-freq MIN_FREQ Minimum frequency for phrases (only for mode "phrases")
--min-pmi MIN_PMI Minimum PMI for phrases (only for mode "phrases")
Helper scripts
-
./auto_phrase.shand./auto_zoekeend.shcan be used to automatically index, search and evaluate the results and store it in a results directory.auto_phraseusesphrase_index.py, whileauto_zoekeendusesze_index.py. -
./batch_phrase.shcan be used to create the results using multiple different variables in one go. -
And display_results.sh can be used to display the evaluation metrics of all previous results. (So MAP, CiP, dictionary size, terms size, number of phrases, AVGDL and SUMDF)