Voici les éléments 1 - 4 sur 4
  • Publication
    Accès libre
    Multilingual and domain-specific IR: a case study in cultural heritage
    (2015)
    Akasereh, Mitra
    ;
    Nowadays we can find data collections in many different languages and in different fields. So we are facing with a rising need for search systems handling multilinguality as well as professional search systems which allow their users to search in a specific field of knowledge.
    In this thesis we propose a search system for data on cultural heritage. Our data comes from different resources located in different countries and written in various languages. We study the specific structure, characteristics and terminology of data in this field in order to build an effective retrieval system. We evaluate different information retrieval models and indexing strategies on monolingual data to find the ones which are effective and compatible with the nature of our data the most. To deal with different languages we study each language separately and propose tools such as stemmers for each language and fusion operators to merge the results from different languages. To be able to cross the languages easily we study different translation methods. Moreover in order to enhance the search results we investigate different query expansion technics.
    Based on our results we propose using models from DFR family for the English language and Okapi model for the French and Polish language along with a light stemmer. For crossing the language barrier we propose using a combination of translation methods. The Z-score operator is the best evaluated one when merging different results from different languages in our multilingual tests. Finally we propose applying query expansion using an external source to improve the search performance.
  • Publication
    Accès libre
    Information retrieval with Hindi, Bengali, and Marathi languages: evaluation and analysis
    (2013) ;
    Akasereh, Mitra
    ;
    Dolamic, Ljiljana
    Our first objective in participating in FIRE evaluation campaigns is to analyze the retrieval effectiveness of various indexing and search strategies when dealing with corpora written in Hindi, Bengali and Marathi languages. As a second goal, we have developed new and more aggressive stemming strategies for both Marathi and Hindi languages during this second campaign. We have compared their retrieval effectiveness with both light stemming strategy and n-gram language-independent approach. As another language-independent indexing strategy, we have evaluated the trunc-n method in which the indexing term is formed by considering only the first n letters of each word. To evaluate these solutions we have used various IR models including models derived from Divergence from Randomness (DFR), Language Model (LM) as well as Okapi, or the classical tf idf vector-processing approach.
    For the three studied languages, our experiments tend to show that IR models derived from Divergence from Randomness (DFR) paradigm tend to produce the best overall results. For these languages, our various experiments demonstrate also that either an aggressive stemming procedure or the trunc-n indexing approach produces better retrieval effectiveness when compared to other word-based or n-gram language-independent approaches. Applying the Z-score as data fusion operator after a blind-query expansion tends also to improve the MAP of the merged run over the best single IR system.
  • Publication
    Accès libre
    Ad hoc retrieval with Marathi language
    (2013)
    Akasereh, Mitra
    ;
    Our goal in participating in FIRE 2011 evaluation campaign is to analyse and evaluate the retrieval effectiveness of our implemented retrieval system when using Marathi language. We have developed a light and an aggressive stemmer for this language as well as a stopword list. In our experiment seven different IR models (language model, DFR-PL2, DFR-PB2, DFR-GL2, DFR-I(n e)C2, tf idf and Okapi) were used to evaluate the influence of these stemmers as well as n-grams and trunc-n language-independent indexing strategies, on retrieval performance. We also applied a pseudo relevance-feedback or blind-query expansion approach to estimate the impact of this approach on enhancing the retrieval effectiveness. Our results show that for Marathi language DFR-I(n e)C2, DFR-PL2 and Okapi IR models result the best performance. For this language trunc-n indexing strategy gives the best retrieval effectiveness comparing to other stemming and indexing approaches. Also the adopted pseudo-relevance feedback approach tends to enhance the retrieval effectiveness.
  • Publication
    Accès libre
    Retrieval effectiveness study with Farsi language
    (2012)
    Akasereh, Mitra
    ;
    Dans le but d’utiliser le persan comme langue de référence, et en utilisant une collection test de 166 774 documents et de 100 requêtes, cette étude évalue la performance des différents modèles de RI sur lesquels sont appliquées diverses stratégies d’indexation et de recherche. De plus, cette étude évalue l’impact de l’élimination de la liste des mots-outils lors de l’indexation. Selon les résultats obtenus, le modèle DFR-I(ne)C2 est le plus performant. L’enracineur léger et l’enracineur pluriel améliorent la performance en comparaison à l’approche sans enracineur. Les stratégies d’indexation, comme tronc-4 et tronc-5 améliorent la performance, alors que les approches comme 3-grams et tronc-3 ont l’impact le plus négatif sur les résultats. Les résultats révèlent que l’élimination de la liste des mots-outils joue un rôle important dans l'amélioration de la performance. L'analyse requêtes par requêtes montre qu’il serait possible d’ajouter des règles supplémentaires aux enracineurs, pour éviter des résultats erronés., Having Farsi as the underlying language and using a test collection of 166,774 documents and 100 topics, this experiment evaluates the retrieval effectiveness of different IR models while using a light and a plural stemmer as well as n-grams and trunc-n indexing strategies. Moreover the impact of stoplist removal is evaluated. According to the obtained results the DFR-I(ne)C2 model is the best performing one. The proposed light and plural stemmer improve the retrieval performance compare to non-stemming approach. Indexing strategies trunc-4 and trunc-5 have also a positive impact on the performance while 3-grams and trunc-3 have the most negative impact on the results. The results reveal that for Farsi stoplist removal plays an important role in improving the retrieval performance. A query-by-query analysis on the results shows that avoiding extreme results would be possible by adding extra controls and rules, according to Farsi morphology, to the stemming algorithms.