Title

Website Indexing
(book and PDF)
Articles - currently unavailable
Indexes compiled - currently unavailable
 Jon's CV
Glenda's CV
Articles on line
MS-Access database development
Jon's novels
Alice M. Browne

Appendix 4: Glossary

Words in bold indicate that there are separate glossary entries on those topics. This glossary is taken from the second edition of Website Indexing, by Glenda Browne and Jonathan Jermey, available from Auslib Press.

Absolute addressing: the practice of including a complete URL in the anchor tag for a link to a webpage: for instance <A HREF=http://www.aussi.org> is a link to an absolute address . Links to other websites are always absolute but links within a website may be absolute or relative.

Adobe Acrobat PDF, see PDF

Agent, see Bot

Anchor (Bookmark): an HTML anchor makes the location in the file at
which it is inserted available as a target for a link. It is written in the format
<A NAME=AnchorName>…</A>.

Automated categorisation: the use of computer software to categorise webpages. It can be done using rule-based methods, in which the system is gradually trained, or by fully automated methods. Taxonomies for categorisation can also be created automatically.

Back-of-book-style indexing: creation of a website index that looks and functions like a back-of-book index. It will usually be alphabetically organised, give detailed access to information, and contain index entries with subdivisions and cross-references.

Bookmark, see Anchor

Boolean ‘and’: Use of the Boolean operator ‘and’ in a query means that all of the terms in the query must be present in a document for it to be retrieved. For example, ‘automated and categorisation’ means that a document must contain the term ‘automated’ and the term ‘categorisation’.

Boolean ‘not’: Use of the Boolean operator ‘not’ in a query means that if the search term is present in a document, that document will not be retrieved. For example, ‘bear not market’ will not retrieve a document with the sentence ‘Share prices have gone down in the bear market’.

Boolean ‘or’: Use of the Boolean operator ‘or’ in a query means that any one of the terms in the query must be present in a document for it to be retrieved. For example, ‘categorisation and categorization’ means that a document must contain either the term ‘categorisation’ or the term ‘categorization’.

Bot (Agent, Robot): programs with some artificial intelligence that are sent to do a task in lieu of a real person. Spiders are one example. They run automatically and act autonomously.

Breadcrumb: link to all levels of the hierarchy above the current location, showing the route a searcher has taken, and the context of the current page. Breadcrumbs allow users to backtrack and to move up the hierarchy.
For example, Rhinitis>Allergic rhinitis>Perennial allergic rhinitis (Hayfever).

‘Breadcrumbs’ is based on the story of Hansel and Gretel, who dropped bits of bread to make a trail to help them find their way out of the forest. (Not that it helped them, as the birds ate the crumbs!).

Breadth: the number of navigation options available at each stage. A home page that provides links to 20 subsections has more breadth than one that says ‘Click here to select a department’.

Cascading style sheet, see Style sheet

Categorisation: the use of hierarchies based on words rather than notations. Each topic is allocated to a group, and that group is allocated to a more general group, and so on. Searching typically involves moving from more general to more specific topics; for example, to search for information on children’s birthday parties you might first select the option Celebrations, then Birthdays, then Children’s parties. Category structures are fairly arbitrary and may vary widely from one site to another; on a different site you might select Catering, then Parties, then Children’s parties, then Children’s birthday parties, eg. By using techniques such as double posting and cross-referencing, categorised sites can provide for access from several different directions. See also automated categorisation.

Chunk: smallest unit of content that is used independently and needs to be indexed individually.

Classification: formal established classification schemes – for example, the Dewey Decimal Classification (DC) and Library of Congress Classification (LC) – that use a notation to describe classes of information.

Collaborative filtering: personalisation technology that uses recommendation engines to extract trends from the behaviour of website visitors and use that information to present suggestions to searchers. Amazon.com uses collaborative filtering to recommend books on the basis of purchases by other people with apparently similar interests.

Concordance: a (usually alphabetical) list of words from a book or website indicating the locations at which they occur. A concordance differs from an index in that no attempt to filter the source material or sensibly collate the information has been made.

Content management system (CMS): system for the creation, modification, archiving and removal of information resources from an organised repository. Includes tools for publishing, format management, revision control, indexing, search and retrieval.

Controlled vocabulary: a list of terms to be used in indexing (or cataloguing), often a thesaurus or synonym ring. Use of the same list by all indexers enhances consistency. Most libraries use the Library of Congress Subject Headings as a controlled vocabulary for cataloguing books and other library items.

Cost-per-click listing (CPC), see Pay-per-click listing (PPC)

Crawler, see Spider

Cross-reference: a See reference or See also reference leading the user from one part of the index to another.

CSS, see Style sheet

Database: a collection of records about individuals. Each record is made up of a number of fields relating to different characteristics of the individual. Many websites and web indexes are generated from databases .

Depth of hierarchy: the number of levels in the navigation hierarchy to the most specific topics. A site where you can select ‘amphibians’ then ‘frogs’ is shallower than one where you have to select ‘animals’, ‘vertebrates’, ‘amphibians’ and then ‘frogs’.

Depth of indexing: the number of entries and their specificity. A deep index will give direct access to all the topics that have been dealt with in the text. A shallow index will cover major and general topics, but will not index minor topics.

Dialog box: a box into which users of a computer application can enter information.

Directory: a collection of evaluated links to websites, usually categorised by subject. Many search engines, such as Yahoo and Google, have associated directories. When directories are limited to information on a specific subject or discipline they are often called subject gateways.

Distributed authoring: content creation by people distributed throughout an organisation, not by a centralised group of web specialists or writers. With distributed authoring there is often an expectation that subject metadata will also be created by authors. This is distributed indexing.

Document: Any item (not necessarily on paper) that can be indexed or catalogued.

DTD (Document Type Definition): schema specification method for XML documents. A DTD is a collection of XML markup declarations that define the structure, elements and attributes that can be used in a document that complies with that DTD. By consulting the DTD a parser can work with the tags from the markup language that document uses. DocBook is an example of a DTD often used with technical documentation to enable sharing and reuse.

Ebook/Electronic book: standalone document intended for on-screen reading on a PC or a handheld device, either a dedicated ‘reader’ or a general purpose Personal Digital Assistant (PDA).

Editorial results: search engine hits dependent on content and not influenced by payment.

Embedded indexing: indexing method in which tagged index entries are inserted into document files. Tags are used to bracket blocks of text and to show headings and subdivisions for index entries. Tagged index entries are not seen in the printed version, but can be compiled by software to make an index. If parts of the document are removed or rearranged the tagged index terms go with them. The index can then be recompiled to give an updated version. Embedded indexing is more time-consuming than normal indexing, but is efficient for documents that change often, or are not complete when indexing starts.

Facet: grouping of concepts of the same inherent type, for example, processes, disciplines, people, materials, places, and times.

Faceted metadata classification: breaking subjects into standard component parts (facets) and presenting these to users as search options. A topic such as wine might be divided into the facets such as country of origin, variety and price. In the best faceted search systems the user is provided with feedback about the number of terms retrieved at each stage.

False drop: document that is retrieved by a search but is not relevant to the searcher’s needs. False drops occur because of words that are written the same but have different meanings (for example, ‘squash’ can refer to a game, a vegetable or an action). 

Field searching: ability to limit a search by requiring that the search term is present in a specific ‘field’ (category of data) in the record. Field searching is often done with categories such as author and date that are common to most records.

Filing order: rules used for ordering (sorting) index entries. When a computer performs the sequencing it is often called sort order. [i]

Gateway, see Subject gateway

Global navigation: generally applicable navigational links (for example, Search; Site Map) available from all pages of a website.

Granularity: level of detail at which information is viewed or described. The more granular an access tool, the smaller the chunks of information it leads to. An index linking to specific paragraphs is more granular than a table of contents or site map linking to specific pages.

<HEAD> section: The <HEAD> section of an HTML document is placed at the top of the page between an opening tag, <HEAD>, and a closing tag, </HEAD>, and contains metadata about the document itself, not the content that will be displayed on the page.  It is followed by the <BODY> section.

Hierarchy: a series of ordered groupings moving from broader general categories to narrow specific ones. In a web directory you may only see one level of the hierarchy at a time. When you select a topic you are then shown the options at the next level. See also Taxonomy; Thesaurus.

Hit highlighting: highlighting of the words in a results list which resulted in a document being retrieved by the search.

HTML: hypertext markup language. The majority of webpages are made up of ordinary text ‘marked up’ with instructions in HTML which determine how the text is displayed by the user’s browser: for instance, the HTML code ‘Huey, <B>Dewey</B> and <I>Louie</I>’ appears in a browser as ‘Huey, Dewey and Louie’. HTML is also used to display graphics and define links to other sites and locations. See also XML.

Hypertext link, see Link

Indented style index: indented indexes start each subdivision on a new line, indented under the main heading. For example:

            names
        
        indexing rules for  41-42
       
         keyword searching and  5

Index entry: record in an index, consisting of a main heading and any associated locators, subheadings, and cross-references. This means the whole ‘metadata’ example below is one entry. When indexers charge by the entry they usually define each cross-reference or locator as an entry, meaning the ‘metadata’ sample below would contain six entries, made up of one cross-reference and five locators.

            metadata, see also thesauri
                  Dublin Core  15, 33-37
                  misspellings useful in  14
                  website structure derived from  99-101, 105

Indexing: often used to refer to the automatic selection and compilation of ‘meaningful’ words from a website into a list that can be used by a search system to retrieve pages. This list is more properly called a concordance. As this procedure involves no intellectual effort indexers distinguish their own work by calling it intellectual indexing, manual indexing, human indexing, or back-of-book-style indexing.

Information architecture: design of the structure of information systems, particularly websites and intranets, including labelling and navigation schemes.

Information foraging: seeking information according to its adaptive value. Information foraging theory analyses trade-offs in the value of information gained against the costs of searching based on the analogy of ‘foraging for wild food’.

Information scent: visual and linguistic cues that indicate to a searcher whether a website has the information they seek, and help the searcher navigate to the required information. Information scent is a component of information foraging.

Instantiation: the electronic or physical manifestation of a resource.

Internet: a global electronic communications system allowing public access to email, newsgroups, chat and the web.

Intranet: a local network with restricted access that uses some or all of the same systems and software as the Internet.

Keyword: a) In the search engine section keywords are words that are used to search for a topic. Also called ‘search terms’. b) In the metadata section, keywords are subject metadata terms. [ii]

Keyword searching: typing significant words and phrases that relate to a topic into a search engine. For example, to find information about your pet, Gerby, you might type the keywords gerbils and sand rats. If you wanted scientific information you could try the scientific terms Gerbillus, Tatera, Taterillus gracilis and so on. If you needed to find more general information you could broaden your search with the terms domestic animals and pets.

Legacy data: data stored on older computer systems or in older formats that remains behind as the legacy of outdated technologies. It can be difficult to integrate into newer systems.

Link: a block of text or a graphic appearing on a webpage, which a user can click with the mouse pointer to cause an event to happen. This usually involves being taken to another webpage or another part of the same page.

Live file: the copy of an electronic document that is currently being worked on, for example, by a writer or indexer. All changes must be made to the live file. If an indexer worked on one copy of a document, and an editor on another, the changes made by one of them would have to be incorporated into the document worked on by the other. (Live in a different context means that the file has been loaded onto the web and made available to users).

Local link, see Relative addressing

Local navigation: links that are specific to a section of a website, compared with global navigation which is available from all parts of a site.

Locator: the part of an index entry that tells the user where to look for information. In a book index locators are usually page numbers (but can also be references to items, paragraphs and so on). In a web index they are direct links to the information. The links can be the heading or subdivisions of the index entry.

Main heading: heading at the beginning of an index entry, either used alone or modified by subheadings. The main heading is an entry point into the index. (Cross-references are the other entry points).

Markup language: a way of depicting the logical structure or semantics of a document and providing instructions to computers on how to handle or display the contents of the file. HTML, XML and RDF are markup languages. Markup indicators are often called tags.

Metadata: structured data about data, which may include information about the author, title and subject of web resources. Metadata is added in the <HEAD> section of the webpage or is stored in a database. It is available for searching but is not displayed on the page.

Multi-purposing, see Single sourcing

Namespace: a closed set of names or a place where a schema (set of names) is stored. Namespaces are identified via a URI (for example, a URL) and are a mechanism to resolve naming conflicts. Within a gi