## Search Engines

### Overview

 By the end of this page you will be able to: understand the difference between search engines, directories, and portals name examples of search engines and directories describe how search engines work know how to refine searches to produce manageable results, understand the meaning of various search terms.
• The IT Crowd - Break the internet(2007)[1]has a light hearted view on how to break the internet using Google:
• A web site to direct people to if they ask something they should have searched for ... http://www.giyf.com/

Advantages of searching via the internet include.

• Available 24 hours/ 7 days
• Research can be anonymous
• Can usually get further information via email
• More economical for both person searching and business. (cost of brochures, current, etc)

Finding information on the WWW can be difficult with hundreds of millions of pages. So like a librarian or a catalogue system in a library, the internet has a variety of resources available to find information. As search sites have evolved, they now combine a variety of tools (such as search engines, crawlers, ) and may have several features to encourage you to use them (directories, portals (news), language translators, email).

### Resources

Here are some sites to help you do research search engines. Whether you are planning to promote your business yourself or outsource, it is a good idea to be "in the know."

• Search Engine Forums ( http://www.searchengineforums.com )
• Online discussion forum with nothing but info on search engines. Newbies and veterans alike bring questions and expertise to the table. Terrific resource for anyone wanting to know more about the search engine game.
• SearchEngineWatch.com ( http://www.searchenginewatch.com )
• The definitives guide to all that is the world of search engines. Founded by Danny Sullivan, a leading expert on search technologies and a 1998 recipient of the Tenagra Award for Internet Marketing Excellence.
• AllSearchEngines ( http://www.allsearchengines.com )
• A little outdated on some of the facts, but good place to find major, pay for placement, topic specific, and other search engines and directories.

## History

## Types

### Global, local and personal

• Global search engine: Programs that run automatically, exploring websites creating catalogues of web pages. Because they index so many web pages, search engines often find information not listed in directories. Some engines are introducing natural language queries, such as Alta vista and Askjeeves, and language translation facilities.
• Local search engine: Searches a specific site, e.g. microsoft (http://www.microsoft.com).
• Personal: With the large hard drives available on personal computers some of the search engines are migrating to the desktop computers.

Examples:

### Knowledge graphs

• Google Knowledge graph serves up facts and services in response to search terms - not just links to websites. Introduced May 2012 it is a step in a process in which search engines are morphing into something quite new: vast brains that respond directly to questions posed in everyday language.
• Microsoft's knowledge graph, known as the Satori database, contains 350 million entities, according to Bing Search director Stefan Weitz. Microsoft's Snapshot service will use its knowledge graph to display links to services associated with the search item. Snapshot's aim is to guess the real-world action that a user is interested in when they search and to return links that enable them to carry out those actions.

### Visual Search engines

NY Times: Image searching on a globe

• Apple Siri.

### Meta crawlers

Send your search to several search engines simultaneously, then blend the results together onto one page. Examples:

### Directories

Pages of organised links compiled by humans, and are best if you know what you’re looking for. Sites are submitted, then are assigned to an appropriate category or categories. Because of the human role, directories can often provide better results than search engines, however they are usually out of date. Many are organised by region. Examples:

### Portals

A user’s first stop on the WWW, providing a search engines plus services such as news, weather, stock market reports, horoscopes, translation services etc. Often these services are provided as a result of strategic alliances, particularly with the large news agencies. Conversely many companies provide portals and include a search engine. Examples:

These engines measure the popularity of the links in search engine results. The links that get followed more often than others then rise higher in the rankings. This filters out irrelevant and un-useful results. Examples:

### Regional and specialist engines

These engines only list websites in a certain geographical area or subject area and can be very useful if you don’t want to search the entire web. There are specialist search engines for law, medicine, web design etc. Examples:

### Government sites

Local and regional government agencies are making most of their public information available over the net. Examples:

### Company

Companies such as Microsoft, Hewlett-Packard and organisations like NASA, and information providers like encyclopedia Briticannica, also provide search engines for their own sites. So for example if you are looking for a solution to a Windows problem you can search the Microsoft site directly. Examples:

### Geographic Information Systems

Search databases based on maps. Used extensively by emergency services.

#### Google Earth/Maps, Local Live, World Wind

• introduced 2005, is a searchable database of satellite maps. Companies can purchase an advertisement that appears as the earth is zoomed. (The MMO right is a 3D representation of EITs IT building)
Browser based version similar to Microsofts.
• Microsoft (2006) Live Search. Retrieved November 29, 2006 from http://local.live.com
• Microsofts mapping system. Live Search is a free online local search and mapping service that combines road and aerial maps worldwide plus unique bird’s eye imagery for select areas. You can also explore the world through maps in 3D view. Use Live Search to learn about, discover, and explore a specific location with the advanced driving directions, traffic information, local listings, and other local search tools.
• Kim, R (2006). NASA World Wind. Retrieved November 29, 2006 from http://worldwind.arc.nasa.gov/
• World Wind lets you zoom from satellite altitude into any place on Earth. Leveraging Landsat satellite imagery and Shuttle Radar Topography Mission data, World Wind lets you experience Earth terrain in visually rich 3D, just as if you were really there.(Version 1.3.5(2006) is a 60MB download, drag with right mouse to show 3D)

Others:

• US Geological Survey (2007) Landsat Image Mosaic Of Antarctica (LIMA). Retrieved December 02, 2007 from http://lima.usgs.gov/access.php
• The Landsat Image Mosaic of Antarctica (LIMA) is seamless and virtually cloudless. LIMA is the most geometrically accurate (within a pixel—30 meters by 30 meters of land) mosaic of Antarctica and has the highest spatial resolution.

### 3D Search Engines

Princeton (2005) Princeton 3D Model Search Engine. Retrieved November 29, 2005 from http://shape.cs.princeton.edu/search.html

Allows you to search for 3D models based on a sketch (done using JavaScript)

### Alternative search engines

Sources include : Bass, S (2003, Aug) Maximum Google : Search Alternatives offer cool features. pp 87-92, NZ PC World, ISSN 0114 7285, IDG Communcations Group

## How search engines work

### Three parts of a search engine

1. Automatic cataloguer An automatic cataloguer, called a spider, crawler or robot roams around the web following links and reading documents. Any linked document could potentially be visited, including those that the document’s owners would rather the world did not see. The spider returns to the site on a regular basis, such as every month or two, to look for changes.

2. Index

The spider results are then gathered together into a huge index, which contains a copy of every web page visited by the spider. words like and, to, a, or, etc. are removed. If a webpage changes then this copy is updated. Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. A web page may have been "spidered" but not yet "indexed." Until it is indexed it is not available to people searching with the search engine. Some engines "expire" pages if they have not been updated within a set time period.

3. Search software

The third part is the search software which sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. There are usually various filters and options that are applied to the search. Different search websites may be using the same index, for example the Inktomi index "powers" the HotBot, MSN and GoTo engines, though results may be slightly different in each since each uses the Inktomi index in a slightly different way.

### Manual registration

It is also possible to register your web pages with search engines. This is particularly useful if your web site is new and is not linked to by other web pages, or if you want to speed up the registration process.

### Servers

Inktomi search engine server room.

Because of the huge indexes and billions of search requests each day, search engine hardware has to be incredibly powerful. Search engines are often a showcase for manufacturers. Part of the server room for the Inktomi index is shown.

### Search Engine Optimization 101

SEO Logic (2007) Search Engine Optimization 101. Retrieved September 3, 2007 from http://www.seologic.com/guide/

## Using search engines

### Refining searches

Searching will often give hundreds of thousands of mainly irrelevant results back. To narrow down a search.

• Use + and AND. Without these the search engine will find pages containing any one of the words. For example say you want to find information on Windows 2000 bugs. If you try a search windows 2000 bugs you will also find documents just about windows, or bugs or that contain the number 2000.

A search of +windows +2000 +bugs will find pages that contain all three words. However this may find pages where the words are completely unrelated to each other. You might find a window cleaners notes on bugs written in 2000.

• Define phrases using quotes. For example a search of "windows 2000" +bugs or even better "windows 98 bugs" which should only find pages that contain that entire phrase.
• Most search engines also recognise proper names, so you can capitalise these. For example use Windows to find information on the Microsoft operating system and windows to find information on the things you look through.
• Subtract subjects you are not interested in. For example if "Windows 2000 bugs" finds lots of resources about Windows 98 or Windows 95 then try a search such as "Windows 2000" +bugs -"Windows 90" -"Windows 95".

Many search engines also allow you to filter your results further. Often you can:

• Search by date, e.g. only those pages that are less than a year old
• Search only web page titles
• Find pages written in a specific language
• Find only pages containing specified media types, e.g. images, audio
• Find only files that end in a certain file extension, e.g. .gif .mp3
• Find pages that are hosted in a particular domain or geographical area
• Search by "depth" of page on a website, e.g. only home pages or only sub-pages

### Boolean searches

These are searches allowing the inclusion or exclusion of documents containing certain words through the use of operators such as AND, NOT and OR. Boolean search commands have been used by professionals for searching through traditional databases, Library catalogues etc. for years. However they are overkill for the average web search as all the search engines support + an -. The OR command is used for specifying alternatives e.g. New Zealand OR Aotearoa. Another Boolean command is NEAR which will find words that are in close proximity to each other. The actual distance between the words depends on the individual search engine. It is possible to nest Boolean searches using brackets, e.g. impeachment AND (clinton OR johnson).

### Natural language queries

These allow you to ask a question using English, e.g. What is the weather like in Auckland? Best for when you are looking for the answer to something, rather than a specific web site, for example What is the population of Ceylon? The best service is Ask Jeeves: http://www.askjeeves.com/ , which is used by the AltaVista search engine.

Probably the most popular search engine that strives to index many pages AND improve usability is Google. Began as an academic search engine. In a paper describing how it was built, Sergey Brin and Lawrence Page indicated that with four spiders the system could crawl 100 pages and generate about 600Kb of data per second. Google spider looked at two things,

Words within a page Where the words found (Meta tags, titles, headings)

       ** GOOGLE QUERY SUBMISSION FORM **

__ Web  __ images __ Groups __News __Froogle

Mail to:
1600 Amphitheatre Parkway
Mountain View, CA 94043

Please allow four to six weeks for results. Thank you.

For definitions try define:

### Search Hints

If you are searching the internet for something specific, you may be able to improve the quality of the results by following the steps below. 1. Enter a URL

2. Use a search tool

3. Narrow the search with a boolean operator

• gardening AND roses

If you want to find a definition of a word

## Terminology

### Search terms

• Concept search: A search for documents related conceptually to a word, rather than specifically containing the word itself.
• Full text index: An index containing every word of every document cataloged, including stop words (defined below)
• Fuzzy search: A search that will find matches even when the words are only partially spelled or misspelled.
• Keyword search: A search for documents containing one or more words specified by a user.
• Phrase search: A search for documents containing an exact sentence or phrase specified by a user.
• Precision: The degree in which a search engine lists documents matching a query. The more matching a query. The more matching documents listed the higher the precision. For example if a search engine lists 80 documents found to match a query but only 20 of them contain the search words, then the precision would be 25%.
• Proximity search: A search where the users specify that documents returned should have the words near each other.
• Query by example: A search where a user instructs an engine to find more documents that are similar to a particular document. Also called find similar.
• Stemming: The ability for a search to include the stem of words. For example stemming allows a user to enter swimming and get back results for the stem word swim.
• Stop Words: conjunctions, prepositions and articles and other words such as AND, TO and A that appear in documents yet alone may contain little meaning.
• Thesaurus: A list of synonyms a search engine can use to find matches for particular words if the words themselves do not appear in documents.

## Web spam: Modifying a web page to influence its ranking

Also

• web spam (Gröngyi and Garcia-Molina 2005; Joachims, Granka et al. 2005; Metaxas and DeStefano 2005),
• search engine spam (Wu and Davison 2005) and
• spamdexing (Gröngyi and Garcia-Molina 2005).

## Emerging technologies

### Image searching

"Search Looks at the Big Picture" Wired News (01/06/05); Gartner, John Computer scientists are working on visualization technologies that would allow image searches to be done without relying solely on text descriptions, such as is currently done with Google, Yahoo!, and other search engines. Image search engines that can identify components of an image will be less vulnerable to manipulation by pornographers or deceitful advertisers that tag pictures as "Britney Spears" or some other popular content. Xerox Research Center Europe and a group of European universities are jointly working on software that recognizes "key patches" in pictures, such as an ocean and beach, or car and tires; since development began in 2002, the software has learned hundreds of objects, and would be useful in distinguishing between dissimilar pictures that share a common keyword. Improved image search would also open up new advertising opportunities, such as a function that allows online shoppers to search for a similar-looking red sweater, but at a lower price than one already found. Such capability would open up a vast new market as companies work to improve their image search rankings for popular products, people, or places. IBM Pervasive Media Management group is also developing visualization technology, but for video; the Marvel system also relies on learned images and categorizes them into concepts, such as travel or sports. IBM is currently working with CNN and ABC to classify news coverage concepts so Marvel technology can be applied to news archives. Search companies have not yet endorsed visualization technology, but Yahoo! is trying to bolster its video search with its Media RSS format and by working with Hollywood studios on metadata tagging.(Gartner, 2005, Jan 6)[6]

• Google Goggles - December 12,2009, Image searching for Android (Google Operating System) phones

## Related applications

• Google has found that certain search terms are good indicators of flu activity. Google Flu Trends uses aggregated Google search data to estimate flu activity in US states up to two weeks faster than traditional systems. (Google, 2009b)[8]

## Search patterns, trends, and surprises

Google aggregates millions of search queries and provides several tools that give insight into global, regional, past and present search trends

### Web Searching Basic

##### Assessed Activity 4.1 (2016) Web Searching [3 marks]
1. In your web site, create a new page called search_basic.html (and link this from the first page - index.html)
2. Create an ordered list in the following format (Note the answers are an indented unordered list).

E.g.

1. What do triskaidekaphobes have a fear of?
• Fear of ....
2. The phrase “In a Jiffy” is used often, how long is a Jiffy?
• A jiffy is ....

1. Answer 6 of the following questions

1. What do triskaidekaphobes have a fear of?
2. The phrase “In a Jiffy” is used often, how long is a Jiffy?
3. The highest temperature ever recorded in the world
4. Although identified with Scotland, bagpipes are a very ancient instrument, where did they come from
5. What is Donald Duck’s middle name and where does he live
6. What colour is the blood of lobsters
7. Which country was the first to use paper money
8. What was the first man-made invention to break the sound barrier
9. What European capital used to be called Lutetia?
10. What animal’s milk is used to make authentic Italian mozzarella cheese?
11. Which former All Black forward was the only one to feature in the top 10 of all time drop goal scorers in NZ rugby?
12. What is the only man made structure that can be seen from outer space
13. What is the largest animal made structure in the world
14. The Zloty is a unit of currency in which country
15. If you had tinnitus, from what would you be suffering?
16. Which planet is known as the Red Planet?
17. The first Bond movie was released in 1964. What was the title?
18. Which computer device has a "bezel?"
19. 'WYSIWYG' is a computer word. But what does it stand for?
20. Most people have heard of an IBM PC, what do the letters IBM stand for

2. Add four more questions and answers of your own. The question/answers should be a little unusual and on things that are not well known.

Marking

Marks will be awarded based on the completeness of your submission

##### Assessed Activity 4.2 (2016) Web Searching - Advanced [2 marks]
1. In your web site, Create another page called search.html. Create a three column table that shows the "Search Performed", "Number of Matches" and "Advanced Searching".
2. Go to http://www.google.com and then click on the Advanced Search link located at the right. For the following always use the advanced search. In the "Number of Matches" column, write the number of results you get. Your goal is to get less than 500 results!

Queries

• Practice
1. WITH ALL: Internet Security
• This search is bad because it will search for every web page that contains Internet OR Security
• You will get different results if you capitalize Internet Security
2. EXACT PHRASE: Internet Security
• The exact phrase is the best way to search because it only finds web pages with the phrase “Internet Security” . Use quotes around a phrase in a normal search to achieve the same result.
• WITH ALL: Internet Security
3. WITHOUT: books or movies
• This gets rid of all the web pages that are selling books or movies
• EXACT PHRASE: Internet Security; WITHOUT: books or movies; OCCURENCES: Choose in the Title
• This only finds web pages with Internet Security in the title of the page.
4. EXACT PHRASE: Internet Security
• WITHOUT: books or movies; OCCURENCES: Choose in the URL; Notice that all the results have Internet Security in the web page!
5. EXACT PHRASE: Internet Risks
• WITHOUT: books or movies; OCCURENCES: Choose in the Title
• Use words that best describe what you are trying to find! If you get less than 500 results, you have done an excellent search!
• Record Results
1. EXACT PHRASE: Internet Risks
• WITHOUT: books or movies; OCCURENCES: Choose in the Title
• Use words that best describe what you are trying to find! If you get less than 500 results, you have done an excellent search!
2. EXACT PHRASE: Internet Risks; WITHOUT: books or movies
• OCCURENCES: Choose in the URL
• This only finds webpages that have Internet Risks in the address name

Marking

• Marks will be awarded based on the completeness of your submission

##### Assessed Activity 4.3 (2016) Web Searching - Google search [3 marks]
2. At the top of the page search.html, add a google search form. (Note it is best to put the form tag just after the body tag (at the top) and close it just before the /body tag as you are only allowed one form element per page.

Marking

• Marks will be awarded based on the completeness of your submission

## Some quirky things (not assessed)

