Saturday, September 10, 2011

Information Retrieval and Search Engine (CSC-414)

Tribhuvan University
Institute of Science and Technology
Bachelor of Science in Computer Science and Information Technology
7th Sem: Course Title: Information Retrieval and Search Engine


Course no: CSC-414                                                               Full Marks: 60+20+20
Credit hours: 3                                                                        Pass Marks: 24+8+8

Nature of course: Theory (3 Hrs.) + Lab (3 Hrs.)

Course Synopsis:       Advanced aspects of Information Retrieval and Search Engine

Goals: To study advance aspects of information retrieval and search engine, encompassing the principles, research results and commercial application of the current technologies.

Course Detail:

Unit 1. Introduction:                                                                     4 Hrs.

Goals and history of IR. The impact of the web on IR. The role of artificial intelligence (AI) in IR.

Unit 2. Basic IR Models:                                                               4 Hrs.

Boolean and vector-space retrieval models; ranked retrieval; text-similarity metrics; TF-IDF (term frequency/inverse document frequency) weighting; cosine similarity.

Unit 3. Basic Tokenizing, Indexing, and Implementation of Vector-Space
            Retrieval:                                                                         4 Hrs.

Simple tokenizing, stop-word removal, and stemming; inverted indices; efficient processing with sparse vectors; Java implementation.

Unit 4. Experimental Evaluation of IR:                                         4 Hrs.

Performance metrics: recall, precision, and F-measure; Evaluations on benchmark text collections.

Unit 5. Query Operations and Languages:                                  3 Hrs.

Relevance feedback; Query expansion; Query languages.

Unit 6. Text Representation:                                                        5 Hrs.

Word statistics; Zipf's law; Porter stemmer; morphology; index term selection; using thesauri. Metadata and markup languages (SGML, HTML, XML).

Unit 7. Search Engine:                                                                 5 Hrs.

Search engines; spidering; metacrawlers; directed spidering; link analysis (e.g. hubs and authorities, Google PageRank); shopping agents.


Unit 8. Text Categorization and Clustering:                                 7 Hrs.

Categorization algorithms: Rocchio; naive Bayes; decision trees; and nearest neighbor. Clustering algorithms: agglomerative clustering; k-means; expectation maximization (EM). Applications to information filtering; organization; and relevance feedback.

Unit 9. Recommender Systems:                                                   3 Hrs.

Collaborative filtering and content-based recommendation of documents and products.

Unit 10. Information Extraction and Integration:                          3 Hrs.

Extracting data from text; XML; semantic web; collecting and integrating specialized information on the web.

Unit 11. Advanced IR Models:                                                       3 Hrs.

Probabilistic models; Generalized Vector Space Model; Latent Semantic Indexing (LSI).

Unit 12. Advanced Indexing and Searching Text:                         5 Hrs.

Efficient string searching and pattern matching.

Laboratory works:    Design and development of search engine.

Text Books:

  1. Modern Information Retrieval, Ricardo Baeza-Yates, Berthier Ribeiro-Neto.
  2. Information Retrieval; Data Structures & Algorithms: Bill Frakes

Homework
Assignment:               Assignment should be given from the throughout the semester.

Computer Usage:      No specific

Prerequisite:              Server side programming language

Category Content:    Science Aspect:           25%
                                    Design Aspect:            75%

0 comments:

Post a Comment