Go to file
2025-09-04 17:31:31 +02:00
helper_scripts Tested code on queries with multiple phrases 2025-08-31 18:09:19 +02:00
ngram_queries Tested code on queries with multiple phrases 2025-08-31 18:09:19 +02:00
queries Tested code on half the queries 2025-08-28 15:57:36 +02:00
.gitignore Updated the tokenizer 2025-09-04 17:09:05 +02:00
batch_search_eval.sh Tested code on half the queries 2025-08-28 15:57:36 +02:00
cranfield_qrels.tsv Updated cost in postings function 2025-08-24 22:00:02 +02:00
cranfield_queries_with_2_bigrams.tsv Tested code on queries with multiple phrases 2025-08-31 18:09:19 +02:00
cranfield_queries.tsv Updated cost in postings function 2025-08-24 22:00:02 +02:00
filter_queries_by_ngrams.py Tested code on queries with multiple phrases 2025-08-31 18:09:19 +02:00
lm.sql Updated cost in postings function 2025-08-24 22:00:02 +02:00
ngrams.csv Tested code on queries with multiple phrases 2025-08-31 18:09:19 +02:00
output_new_postings_bigrams.csv Tested code on queries with multiple phrases 2025-08-31 18:09:19 +02:00
output_new_postings_half1.csv Tested code on half the queries 2025-08-28 15:57:36 +02:00
output_new_postings_half2.csv Tested code on half the queries 2025-08-28 15:57:36 +02:00
output_new_postings.csv Moved file 2025-08-24 22:01:44 +02:00
output.csv Improved code 2025-08-19 17:23:02 +02:00
phrase_index.py Removed unnecessary code 2025-09-04 17:31:31 +02:00
phrases_extractor.py Removed unused imports 2025-08-19 17:35:32 +02:00
README.md Improved code 2025-08-19 17:23:02 +02:00
ze_eval.py Improved code 2025-08-19 17:23:02 +02:00
ze_index_export.py Improved code 2025-08-19 17:23:02 +02:00
ze_index_import.py Improved code 2025-08-19 17:23:02 +02:00
ze_index.py Updated the tokenizer 2025-09-04 17:09:05 +02:00
ze_reindex_const.py Improved code 2025-08-19 17:23:02 +02:00
ze_reindex_fitted.py Improved code 2025-08-19 17:23:02 +02:00
ze_reindex_group.py Improved code 2025-08-19 17:23:02 +02:00
ze_reindex_prior.py Improved code 2025-08-19 17:23:02 +02:00
ze_search.py Improved code 2025-08-19 17:23:02 +02:00
ze_vacuum.py Improved code 2025-08-19 17:23:02 +02:00
zoekeend Improved code 2025-08-19 17:23:02 +02:00

How to use

Run python3 phrase_index.py with any of the parameters listed below:

  -h, --help            show this help message and exit
  --db DB               Database file name
  --dataset DATASET     ir_datasets name (e.g., cranfield, msmarco-passage)
  --stopwords STOPWORDS Stopwords to use (english, none)
  --mode MODE           Indexing mode (duckdb, phrases)
  --min-freq MIN_FREQ   Minimum frequency for phrases (only for mode "phrases")
  --min-pmi MIN_PMI     Minimum PMI for phrases (only for mode "phrases")

Helper scripts

  • ./auto_phrase.sh and ./auto_zoekeend.sh can be used to automatically index, search and evaluate the results and store it in a results directory. auto_phrase uses phrase_index.py, while auto_zoekeend uses ze_index.py.

  • ./batch_phrase.sh can be used to create the results using multiple different variables in one go.

  • And display_results.sh can be used to display the evaluation metrics of all previous results. (So MAP, CiP, dictionary size, terms size, number of phrases, AVGDL and SUMDF)