Arthur/Zoekeend-Phrase-Indexing - Zoekeend-Phrase-Indexing - Gitea: Git with a cup of tea

Arthur/Zoekeend-Phrase-Indexing

mirror of https://github.com/ArthurIdema/Zoekeend-Phrase-Indexing.git synced 2026-06-20 08:30:36 +00:00

Go to file

Arthur Idema d1d7eb517b Added descriptions to files

2026-01-12 15:29:00 +01:00

Added sign test, updated results,

2026-01-08 12:29:53 +01:00

Added sign test, updated results,

2026-01-08 12:29:53 +01:00

Added descriptions to files

2026-01-12 15:29:00 +01:00

Tested code on queries with multiple phrases

2025-08-31 18:09:19 +02:00

Tested code on half the queries

2025-08-28 15:57:36 +02:00

Added descriptions to files

2026-01-12 15:29:00 +01:00

Added sign test, updated results,

2026-01-08 12:29:53 +01:00

.gitignore

Added descriptions to files

2026-01-12 15:29:00 +01:00

batch_search_eval.sh

Added descriptions to files

2026-01-12 15:29:00 +01:00

compare_phrases_vs_duckdb.py

Added descriptions to files

2026-01-12 15:29:00 +01:00

compare_postings_cost_vs_duckdb.py

Added descriptions to files

2026-01-12 15:29:00 +01:00

cranfield_qrels.tsv

Updated cost in postings function

2025-08-24 22:00:02 +02:00

cranfield_queries_with_2_bigrams.tsv

Tested code on queries with multiple phrases

2025-08-31 18:09:19 +02:00

cranfield_queries.tsv

Updated cost in postings function

2025-08-24 22:00:02 +02:00

filter_queries_by_ngrams.py

Tested code on queries with multiple phrases

2025-08-31 18:09:19 +02:00

lm.sql

Updated cost in postings function

2025-08-24 22:00:02 +02:00

ngrams.csv

Tested code on queries with multiple phrases

2025-08-31 18:09:19 +02:00

parse_eval_to_csv.py

Added sign test, updated results,

2026-01-08 12:29:53 +01:00

phrase_index.py

Removed unnecessary code

2025-09-04 17:31:31 +02:00

phrases_extractor.py

Removed unused imports

2025-08-19 17:35:32 +02:00

query_splitter.py

Added sign test, updated results,

2026-01-08 12:29:53 +01:00

README.md

Added descriptions to files

2026-01-12 15:29:00 +01:00

stopwords.txt

Add stopwords list

2025-12-02 23:50:48 +10:00

ze_eval.py

Improved code

2025-08-19 17:23:02 +02:00

ze_index_export.py

Improved code

2025-08-19 17:23:02 +02:00

ze_index_import.py

Improved code

2025-08-19 17:23:02 +02:00

ze_index.py

Updated the tokenizer

2025-09-04 17:09:05 +02:00

ze_reindex_const.py

Improved code

2025-08-19 17:23:02 +02:00

ze_reindex_fitted.py

Improved code

2025-08-19 17:23:02 +02:00

ze_reindex_group.py

Improved code

2025-08-19 17:23:02 +02:00

ze_reindex_prior.py

Improved code

2025-08-19 17:23:02 +02:00

ze_search.py

Added sign test, updated results,

2026-01-08 12:29:53 +01:00

ze_vacuum.py

Improved code

2025-08-19 17:23:02 +02:00

zoekeend

Added sign test, updated results,

2026-01-08 12:29:53 +01:00

README.md

How to use

Run python3 phrase_index.py with any of the parameters listed below:

  -h, --help            show this help message and exit
  --db DB               Database file name
  --dataset DATASET     ir_datasets name (e.g., cranfield, msmarco-passage)
  --stopwords STOPWORDS Stopwords to use (english, none)
  --mode MODE           Indexing mode (duckdb, phrases)
  --min-freq MIN_FREQ   Minimum frequency for phrases (only for mode "phrases")
  --min-pmi MIN_PMI     Minimum PMI for phrases (only for mode "phrases")

Helper scripts

./auto_phrase.sh and ./auto_zoekeend.sh can be used to automatically index, search and evaluate the results and store it in a results directory. auto_phrase uses phrase_index.py, while auto_zoekeend uses ze_index.py.
./batch_phrase.sh can be used to create the results using multiple different variables in one go.
And display_results.sh can be used to display the evaluation metrics of all previous results. (So MAP, CiP, dictionary size, terms size, number of phrases, AVGDL and SUMDF)

Statistical Analysis and Comparison

compare_phrases_vs_duckdb.py - Performs two-tailed pairwise sign test comparing MAP (Mean Average Precision) results between phrase-based and baseline approaches. Uses min_pmi=24 as baseline. Requires scipy for statistical testing.
compare_postings_cost_vs_duckdb.py - Similar to above but compares Cost in Postings (CiP) metric instead of MAP. Evaluates computational efficiency of different indexing approaches.