Understanding Indexing: Key Concepts and Methods
Indexing plays a crucial role in organizing and retrieving information efficiently. It simplifies data, enhances accuracy, and enables quick access. This comprehensive guide explores the concept of indexing, different methods like pre-coordinate and post-coordinate indexing, factors affecting indexing performance, and the importance of choosing the right indexing approach.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
MLIS 102 Advanced Information Retrieval System (IRS) (Theory) Indexing: Specific Aspects (i) Indexing: Concept, Definitions, Functions (ii) Citation Indexing: Concept, Definitions- SCI & SSCI. (iii) Thesaurus Construction.
Indexing: Specific Aspects Indexing Meaning and Concept: Indexing is a useful and widely applied concept in various fields and disciplines. It can help to simplify complex data, improve efficiency and accuracy, and facilitate comparison and analysis. Indexing is a technique that helps to efficiently retrieve data from a database or a file. It involves creating a data structure, called an index, that stores some information about the data and allows faster access to it. Indexing is similar to having a table of contents in a book, where you can quickly find the page number of a topic without scanning the whole book. Some of the factors that affect the performance of indexing are: Access type: This refers to how the data is searched, such as by value, range, or pattern. Access time: This refers to how long it takes to find the data using the index. Insertion time: This refers to how long it takes to add new data and update the index. Deletion time: This refers to how long it takes to remove data and update the index. Space overhead: This refers to how much extra space is required by the index. Indexing can improve the speed and efficiency of data retrieval, but it also requires more space and maintenance. Therefore, it is important to choose the appropriate indexing method and attributes for each data set and query.
Assigned indexing is also known as concept indexing, because it involves identifying concept(s) associated with the content of each document. It is a method of indexing in which a human indexer selects one or more subject headings or descriptors from a list of controlled vocabulary to represent the subject(s) of a work. The indexing terms selected to represent the content need not appear in the title or text of the document indexed. Here, an indexing language is designed and it is used for both indexing and searching. Some notable examples of assignment indexing are chain indexing, PRECIS, POPSI, classification schemes, etc.
Methods of Indexing Mainly two types of methods using in indexing: Pre- coordinate indexing Post coordinate indexing Pre- coordinate indexing Alphabetical Subject indexing Classified subject indexing Chain indexing (Dr. S. R. Ranganathan) PRECIS (Derek Austin) POPSI (G. Bhattacharya, 1964) Post coordinate indexing Uniterm indexing System (M. Taube) KWIC indexing (H. P. Luhan) KWOC Citation (Eugine Garfield) Peak-a-boo (Batan) Zeta Coding System (Moores)
PRE-COORDINATE INDEXING SYSTEMS Pre-coordinate indexing systems are methods of organizing and retrieving information based on the combination of terms or concepts that represent the subject of a document. In pre-coordinate indexing, the indexer selects and arranges the terms or concepts before any user request is made, and creates an index entry that reflects the logical relationship among them. The index entry is usually a string or a chain of terms or symbols that can be searched as a whole. Pre-coordinate indexing systems are commonly used in printed indexes, such as library catalogs, bibliographies, and abstracting and indexing journals. (Terms/Strings/Role operators) Some of the advantages of pre-coordinate indexing systems are: They provide context and clarity for the subject of the document, as the terms or concepts are linked by rules of syntax and punctuation. They allow for browsing and similar discovery of related documents, as the index entries are arranged in alphabetical or classified order. They can express complex and compound subjects more effectively than single words or post-coordinated terms. They can be faceted or deconstructed by systems to display different aspects or categories of the subject, such as place, time, form, etc. Some of the disadvantages of pre-coordinate indexing systems are: They require a lot of intellectual effort and skill from the indexer to select and coordinate the terms or concepts appropriately. They may not match the exact query or need of the user, as the user has to formulate the search using the same terms or concepts as the indexer. They may result in inconsistent or incomplete coverage of the subject, as different indexers may use different terms or concepts or different levels of specificity. They may become outdated or obsolete as new terms or concepts emerge or change over time. Some examples of pre-coordinate indexing systems are: Ranganathan s Chain Indexing, which uses a scheme of classification and a set of rules to construct index entries from class numbers and facet indicators. G. Bhattacharya s POPSI (Postulate-Based Permuted Subject Indexing), which uses a set of postulates and principles to generate index entries from keywords and their modifiers. Derek Austin s PRECIS (Preserved Context Index System), which uses a set of roles and relations to create index entries from keywords and their attributes.
PRECIS (Preserved Context Index System) PRECIS is a system of subject indexing in which the initial string of terms organised according to the scheme of role operators. Developed by Derek Austin in 1971, it was developed for British National Bibliography for Subject Indexing. Before PRECIS, BNB used Chain Indexing, because at that time BNB developed the MARC system and also using computer and its face many difficulties so that chain indexing was replaced by PRECIS. After that in the year 1990, PRECIS was replaced by COMPASS.
Concept of PRECIS There are is main concepts: Term: A term is a verbal representation of a concept. It may consist of one or more words. String: An order sequence of a component terms, excluding articles, prepositions etc. proceeded by role operators is called a string. The string represents the subject of the documents. Role Operators: The role operators are the code symbols which shows component term and fix its position in the strings. These role operators are meant for the guidance of the indexers only and do not appear in the index entry. the function of the
There are two types of role operators: Primary Operators (Mainline Operators) Secondary Operators (Interposed Operators) Primary Operators Environment Core Concept 0 Location Key System/Objective of transitive Action 1 Action; effect of action 2 Core Concept Performer of transitive action (Agent of Action) View Point 3 4 Selected Instance: study region, study example- sample population Form of document; target user 5 Extra Core Concept 6
Secondary Operators There are three parts- (f) And (g) (p), (q), and (r) (s), (t), and (u) Co-ordinate concept for dependent elements for special classes of action. (f) Bound coordinate concept (g) Standard coordinate concept (p) Part: Property (q) Member of quasi- generic group (r) Assembly (s) Roll definer (t) Author Attributed Action (u) Two way interaction
Codes: 1st concept in coordinate theme $x Theme Interlinks 2nd concept in subsequent theme $y Common concept $z Common Noun $a Term Codes Proper Name $c Place Name $d
Entry Structure of PRECIS Two line, Three parts Structure. Lead:- Lead position serves as the user s approach term by which user may search the index. Qualifier:- Qualifier position is occupied by the term that set the lead into its widen context. Display:- it is the remaining part of the string. Lead 1st part Qulifier 2nd part c B A Display 3rd part D E
Objectives of PRECIS Objectives of PRECIS The computer, not the indexer, should produce all index entries. The indexer s responsibility is to prepare the input strings and to give necessary instructions to the computers to generate index entries according to definite formats. Each of the sought terms should find index entries and each entry should express the complete thought content / full context of the document unlike the chain procedure where only one entry is specific i.e. fully co- extensive with the subject of the document and others are cross references describing only one aspect of the thought content of the document. Each of the entry should be expressive. The system should be based on a single set of logical rules to make it consistent. The system should be based on the concept of open-ended vocabulary, which means that terms can be admitted into the index at any time, as soon as they have been encountered in the literature. The system must have sufficient references between semantically related terms.
Features of PRECIS It is more amenable to automatic manipulation than indexing based on the notational classifications. The permuted entries read naturally, which is achieved by the prescribed order of the role operators; The terms are linked to a machine-held thesaurus thereby providing possible see and see also references; PRECIS can be adapted to other languages. The indexer determines the meaning of the terms codes the roles and identifies the lead terms, whereas the computer takes care of the permutations. Its subject formulation is completely independent of therefore exclusively geared to no classification numbers assigned in the MARC record. Context is preserved: It presents the full subject statement at every point of index entry, by gradual inversion of the concept string, thus overcoming the problem of the disappearing chain. classification,
Citation Indexing: Concept, Definitions (SCI & SSCI) A method of tracking the impact and influence of scholarly publications. Citation indexing is a kind of bibliographic index that records the citations between publications. It allows the user to easily find out which later documents cite which earlier documents. A citation index can also reveal the connections and relationships among different fields and disciplines of research. The concept of citation indexing is not new. It can be traced back to the 12th century, when Maimonides created an index of biblical citations in rabbinic literature. Later, citation indexes were developed for legal cases, such as Shepard s Citations in 1873. The first citation index for scientific papers was introduced by Eugene Garfield s Institute for Scientific Information (ISI) in 1961. It was called the Science Citation Index (SCI), and it covered journals from various disciplines of science. Later, ISI expanded its coverage to include the Social Sciences Citation Index (SSCI) and the Arts and Humanities Citation Index (AHCI) . Citation indexes have several advantages over traditional indexing and abstracting services. They are multidisciplinary, covering a wide range of subjects and sources. They are also independent of language, title words, or author keywords, since they rely on citation connections to retrieve relevant papers. They enable various citation-based search strategies, such as bibliographic coupling, co-citation, and keywords plus. Citation indexes are also useful for measuring the impact and quality of research. They provide indicators such as citation counts, h-index, impact factor, and eigenfactor (The Eigenfactor score, developed by Jevin West and Carl Bergstrom at the University of Washington, is a rating of the total importance of a scientific journal.), which reflect how often a paper, an author, or a journal is cited by others. These indicators can help researchers identify influential papers, authors, and journals in their fields. Today, there are many sources of citation data available online, such as Google Scholar, Academic, Scopus, and Web of Science. Each source has its own strengths and limitations in terms of coverage, accuracy, and functionality. Users should be aware of these differences and choose the most appropriate source for their needs.
Thesaurus Construction A thesaurus is a type of controlled vocabulary that lists synonyms and related terms for a given concept. It is used to improve the consistency and precision of indexing and searching documents in a specific domain or discipline. Thesaurus construction is the process of creating and maintaining a thesaurus. It involves the following steps: Collecting terms: This step involves identifying and selecting the relevant terms that represent the concepts in the domain of interest. The terms can be extracted from various sources, such as existing documents, databases, dictionaries, glossaries, subject headings, etc. Modifying terms: This step involves modifying the terms to conform to certain standards and rules, such as spelling, grammar, format, etc. It also involves eliminating duplicate, ambiguous, or obsolete terms, and resolving conflicts or inconsistencies among different sources. Assigning descriptors and non-descriptors: This step involves deciding which terms will be used as preferred terms (also called descriptors) and which terms will be used as non-preferred terms (also called non-descriptors or synonyms). A descriptor is a term that is chosen to represent a concept in the thesaurus, while a non- descriptor is a term that is not chosen but is related to a descriptor. Non-descriptors are linked to descriptors by equivalence relationships, such as USE and USED FOR . Establishing semantic relationships: This step involves establishing the hierarchical, associative, and equivalence relationships among the descriptors in the thesaurus. A hierarchical relationship indicates that one descriptor is a broader or narrower term of another descriptor. An associative relationship indicates that two descriptors are related but not hierarchically. An equivalence relationship indicates that two descriptors are synonyms or variants of each other. Defining scope notes: This step involves providing definitions or explanations for the descriptors in the thesaurus. A scope note is a brief statement that clarifies the meaning, scope, or usage of a descriptor. It helps to avoid confusion or ambiguity among similar or overlapping concepts. Revising and updating: This step involves reviewing and evaluating the thesaurus for accuracy, completeness, consistency, and currency. It also involves updating the thesaurus to reflect new developments or changes in the domain. Thesaurus construction is an important and challenging task that requires domain knowledge, linguistic skills, and analytical abilities. It can be done manually or with the help of software tools. A well-constructed thesaurus can enhance the quality and efficiency of information retrieval and management.
Thesaurus is newly developed tool to control the indexing language. This word was first comes in 1957 and since then the application of this word has been increasing day by day in information retrieval methods and systems. Today the compilation of many thesauri (plural of thesaurus) in different subjects is being done for different aims. In Hindi, thesaurus may be a dictionary of specific terms. The thesaurus in printed form are easily available today for the sue. 1. Meaning of Thesaurus : The dictionary meaning of the word thesaurus is a collection of words put in groups together according to likeliness in their meaning rather than an alphabetical list. However, in library and information science parlance, the word thesaurus means an authoritative list showing terms which may and 'sometimes very may not be used in catalogue or index to describe concepts. Technically. a thesaurus could be defined as a compilation of words and phrases showing synonyms, hierarchical and other relationships and dependencies, the function of which is to provide a standard Vocabulary for information storage and retrieval systems. A thesaurus may be detined either in terms of its functions or its structure. In terms of functions, a thesaurus is a terminological control device used in translating from natural language of documents, indexers or users into a more constrained system of language. And in terms of structure a thesaurus is a controlled and dynamic vocabulary of semantically and generally related terms which covers a specific domain of knowledge. (a) Regular Thesaurus - It comprises of definitions composed of description of terms. It may bear any of several relationships to the entry. A relationship of near quality is often used more frequently expressed as 'see also'. Terms are well connected by see and see also cross references, so as to avoid delicacy. (b) Stem Thesaurus It consists simply of a list of word stem, by using the words included in a particular document collection, each distinct word stem being furnished with a different sequence number. There are mainly two types of thesaurus. 2. Advantages of Thesaurus : The use of thesaurus gives us following advantage in information retrieval systems. 1. Information retrieval procedures can be extended to collections in many different areas, since the thesaurus problem no longer constitutes an impediment. 3. 4. The investigation of differences in vocabulary between different subject areas becomes possible. Thesaurus removes any possible differences in retrieval effectiveness between different subject areas due to disturbances introduced by varying methods of thesaurus construction. The investigation of the retrieval effectiveness of a variety of thesauri for a given collection, including variation in the thesaurus size, becomes possible in the number of concept classes and in ne correspondent's assigned to each class.
3. Main Features of a Thesaurus : The main part of a thesaurus is a list of terms and defines the index language. These terms are normally featured in an alphabetical sequence. In a thesaurus, descriptors or terms which are acceptable for use in indexes to describe subject concept, as well as non-descriptors, or terms which are not to be used in the / index but appear in order to expand the entry vocabulary of the indexing language, are also featured. Most of the terms in thesaurus are single concept terms, even through in some instances a deliberate decision may be found indicating listing of multi-concept terms. Thus single words, phrases of two or three words, words linked by "and' compound phrases and names of persons are featured in the thesaurus. Relationships between terms in an indexing language are also indicated in most of the thesauri. These relationships might be preferential, hierarchical and affinitive. 4. Role of Theasurus in IRS : The advances in technology and information explosion have ushered in a new concept in the information retrieval, which is known as on-line system. In an on-line system of information retrieval, the user is in direct communication with the computer through a terminal. Simply the use of computer is not a guarantee for the access of information retrieval system. In fact, the entire structure of IRS depends on the intellectual organization of document contents and logical processing of the request, i.e. searching. In it thesaurus can play a best role as follows : (i) As a Language normalization tool : In the on-line IR System, use of natural language searching capabilities enhances the performance of the system. Now, when the natural language is the primary input to an information system, any content analysis must include means for consistent language normalization. A well defined terminology adds to the clear perception of any given topic. Any subject is identified by various symbols attached to it. These symbols take the concrete form of thesaurus. The thesaurus is an organized representation of this symbolism and it is also the most effective tool for providing language normalization. It is the most essentially required feature of any IR system and specially on- line system.
(ii) As intellectual aid : In an on-line system, the searcher is entrusted with a very heavy responsibilities of intellectual burden because the primary input is in natural language. Hence on the search level, all possible entry points have to the thought of and searched to retrieve the relevant information. Thesaurus is the most effective tool to lighten this intellectual burden and to help the searcher in standardizing the language. In fact, rather than in the controlled vocabulary indexing system, a thesaurus proves to be much more beneficial to the searcher in a natural language system. Here it functions as a search tool and of the most essential aids of searching profession. (iii) As a tool for vocabulary control : Vocabulary control is necessary in respect of terms used as subject identifiers in a catalogue or index, because of the variety of natural language. Such control may involve barring of certain terms from use as headings or access points in a library catalogue or in index. The terms which are to be used are specified and the synonyms recognized and as far as possible are eliminated. Preferred Word forms are noted. The list of terms thus prepared constitutes what is called indexing language. One of the methods by which such a language is formed, is to list or store the acceptable terms in Vocabulary. Such lists contain specific decisions relating to the preferred words, and also decisions regarding the form of words be used, for example singular or plural nouns or adjectives. Thesaurus is one of them to control the terminology used in subject catalogues and indexes. Hence thesaurus plays a role also as a tool for vocabulary control for indexing. Thus in conclusion we say that the thesaurus is an indexing and searching aid for librarians and information scientists. It is true that infinite intellectual endeavor is involved in the construction of a thesaurus. Nevertheless this searching tool can immensely enhance the performance of on-line lRS. In the long run it repaves the cost manifold in saving time of searching and thereby making the on-line system Cost effective. ln this age we need thesaurus on each and every subject area. Only when the thesaurus covers the entire subject fields we can sure of satisfying the information needs of all and providing effective on-line information retrieval system.