VirtualMV/Internet and Web/Search Engines/Content

Overview

 * The IT Crowd - Break the internet(2007) has a light hearted view on how to break the internet using Google:
 * A web site to direct people to if they ask something they should have searched for ... http://www.giyf.com/

Internet searching advantages
Advantages of searching via the internet include. Finding information on the WWW can be difficult with hundreds of millions of pages. So like a librarian or a catalogue system in a library, the internet has a variety of resources available to find information. As search sites have evolved, they now combine a variety of tools (such as search engines, crawlers, ) and may have several features to encourage you to use them (directories, portals (news), language translators, email).
 * Available 24 hours/ 7 days
 * Research can be anonymous
 * Can usually get further information via email
 * More economical for both person searching and business. (cost of brochures, current, etc)

Resources

 * Barker, J. (1999) Finding information on the Internet, University of Berkley. Retrieved April 06, 2000, from http://www.lib.berkeley.edu/TeachingLibGuides/Internet/FindInfo.html
 * Search Engine Watch ( http://www.searchenginewatch.com/)
 * How stuff works ( http://www.howstuffworks.com)
 * About ( http://about.com/ )
 * KartOO visual meta search engine. Try this one against some of the others and see how the results compare A very different visual interface. Unfortunately closed down in 2010 (Kartoo, 2010)
 * Mooter Web search Engine. Here's another one that has a different approach an produces results in 'clusters' http://www.mooter.com/
 * Search engine tutorial. Useful if you are still getting 1.5 million hits. http://www.pandia.com/goalgetter/
 * BareBones 101. Another internet search tutorial which should help you hone your searching skills. http://www.sc.edu/beaufort/library/pages/bones/bones.shtml
 * Search Engine showdown. It's called the users guide to Web searching. It has some very useful hints, tips, resources and comments http://searchengineshowdown.com/

Here are some sites to help you do research search engines. Whether you are planning to promote your business yourself or outsource, it is a good idea to be "in the know."
 * Search Engine Forums ( http://www.searchengineforums.com )
 * Online discussion forum with nothing but info on search engines. Newbies and veterans alike bring questions and expertise to the table. Terrific resource for anyone wanting to know more about the search engine game.
 * SearchEngineWatch.com ( http://www.searchenginewatch.com )
 * The definitives guide to all that is the world of search engines. Founded by Danny Sullivan, a leading expert on search technologies and a 1998 recipient of the Tenagra Award for Internet Marketing Excellence.
 * AllSearchEngines ( http://www.allsearchengines.com )
 * A little outdated on some of the facts, but good place to find major, pay for placement, topic specific, and other search engines and directories.

History
(Backup: File:2010PPCBlog infographic-internet-search-engines-history.jpg)

Global, local and personal
Examples:
 * Global search engine: Programs that run automatically, exploring websites creating catalogues of web pages. Because they index so many web pages, search engines often find information not listed in directories. Some engines are introducing natural language queries, such as Alta vista and Askjeeves, and language translation facilities.
 * Local search engine: Searches a specific site, e.g. microsoft (http://www.microsoft.com).
 * Personal: With the large hard drives available on personal computers some of the search engines are migrating to the desktop computers.
 * Google: http://www.google.com/
 * AltaVista: http://www.altavista.com/
 * Askjeeves: http://www.askjeeves.com/
 * HotBot: http://www.hotbot.com/
 * Lycos: http://www.lycos.com/
 * DuckDuckgo: http://duckduckgo.com/

Knowledge graphs

 * Google Knowledge graph serves up facts and services in response to search terms - not just links to websites. Introduced May 2012 it is a step in a process in which search engines are morphing into something quite new: vast brains that respond directly to questions posed in everyday language.
 * Microsoft's knowledge graph, known as the Satori database, contains 350 million entities, according to Bing Search director Stefan Weitz. Microsoft's Snapshot service will use its knowledge graph to display links to services associated with the search item. Snapshot's aim is to guess the real-world action that a user is interested in when they search and to return links that enable them to carry out those actions.

Visual Search engines

 * A Moment in time (New York Times, 2010)
 * Search Cube

Audio search engines

 * Apple Siri.

Computational Search Engines

 * Wolfram Alpha
 * Computing a theory of everything (Wolfram,2010) : A TED talk by Stephen Wolfram, creator of Mathematica, talks about his quest to make all knowledge computational -- able to be searched, processed and manipulated

Meta crawlers
Send your search to several search engines simultaneously, then blend the results together onto one page. Examples:
 * Dogpile: http://www.dogpile.com/
 * MetaCrawler (Go2Net): http://www.metacrawler.com/
 * Metafind: http://www.metafind.com/
 * Webcrawler: http://www.webcrawler.com
 * http://www.zapmeta.com/

Directories
Pages of organised links compiled by humans, and are best if you know what you’re looking for. Sites are submitted, then are assigned to an appropriate category or categories. Because of the human role, directories can often provide better results than search engines, however they are usually out of date. Many are organised by region. Examples:
 * Accessnz: http://www.accessnz.co.nz
 * LookSmart:http://www.looksmart.com/
 * NZ pages: http://www.nzpages.co.nz
 * NZ Yellowpages: http://www.yellowpages.co.nz
 * Snap: http://www.snap.com/
 * Spectel online: http://www.spectel.co.nz (online library of architectural, building and design products)
 * Utopia: http://www.utopia.co.nz
 * Yahoo: http://www.yahoo.com/

Portals
A user’s first stop on the WWW, providing a search engines plus services such as news, weather, stock market reports, horoscopes, translation services etc. Often these services are provided as a result of strategic alliances, particularly with the large news agencies. Conversely many companies provide portals and include a search engine. Examples:
 * Anzwers ( http://www.anzwers.com.au/ )
 * Excite ( http://www.excite.com/ )
 * HotBot ( http://www.hotbot.com/ )
 * Microsoft ( http://www.microsoft.co.nz ) and ( http://www.msn.com )
 * NZ City ( http://www.nzcity.co.nz )
 * Xtra ( http://www.xtra.co.nz )

Link popularity engines
These engines measure the popularity of the links in search engine results. The links that get followed more often than others then rise higher in the rankings. This filters out irrelevant and un-useful results. Examples:
 * Google ( http://www.google.com/ )
 * Direct Hit (best seen on HotBot)

Regional and specialist engines
These engines only list websites in a certain geographical area or subject area and can be very useful if you don’t want to search the entire web. There are specialist search engines for law, medicine, web design etc. Examples:
 * Anzwers ( http://www.anzwers.co.nz )has Australian bias
 * Connectus ( http://www.connectus.co.nz ) building trade oriented
 * NZexplorer ( http://www.nzexplorer.co.nz )
 * Searchnz ( http://www.searchnz.co.nz )uses fuzzy logic

Specialist academic
http://scholar.google.com/ A collection of "scholarly" works, suitable for academic research.

Government sites
Local and regional government agencies are making most of their public information available over the net. Examples:


 * NZ Government ( http://www.govt.nz/ )
 * Statistics NZ ( http://www.stats.govt.nz )
 * TradeNZ ( http://www.tradenz.govt.nz )for importers and exporters

Company
Companies such as Microsoft, Hewlett-Packard and organisations like NASA, and information providers like encyclopedia Briticannica, also provide search engines for their own sites. So for example if you are looking for a solution to a Windows problem you can search the Microsoft site directly. Examples:


 * Encyclopedia Briticannica ( http://www.britannica.com/ )
 * Hewlett-Packard ( http://www.hp.com )
 * Microsoft ( http://www.microsoft.com )

Geographic Information Systems
Search databases based on maps. Used extensively by emergency services.

Google Earth/Maps, Local Live, World Wind
Browser based version similar to Microsofts.
 * Google(2006) Google Earth – Home. Retrieved November 29, 2006 from http://www.earth.google.com
 * introduced 2005, is a searchable database of satellite maps. Companies can purchase an advertisement that appears as the earth is zoomed. (The MMO right is a 3D representation of EITs IT building)
 * Google (2007) Maps. Retrieved September 20, 2007 from  http://maps.google.com/
 * Top 10 Things You Didn't Know Google Maps Could Do (Purdy,2010)


 * Microsoft (2006) Live Search. Retrieved November 29, 2006 from http://local.live.com
 * Microsofts mapping system. Live Search is a free online local search and mapping service that combines road and aerial maps worldwide plus unique bird’s eye imagery for select areas. You can also explore the world through maps in 3D view. Use Live Search to learn about, discover, and explore a specific location with the advanced driving directions, traffic information, local listings, and other local search tools.

Others:
 * Kim, R (2006). NASA World Wind. Retrieved November 29, 2006 from http://worldwind.arc.nasa.gov/
 * World Wind lets you zoom from satellite altitude into any place on Earth. Leveraging Landsat satellite imagery and Shuttle Radar Topography Mission data, World Wind lets you experience Earth terrain in visually rich 3D, just as if you were really there.(Version 1.3.5(2006) is a 60MB download, drag with right mouse to show 3D)
 * US Geological Survey (2007) Landsat Image Mosaic Of Antarctica (LIMA). Retrieved December 02, 2007 from  http://lima.usgs.gov/access.php
 * The Landsat Image Mosaic of Antarctica (LIMA) is seamless and virtually cloudless. LIMA is the most geometrically accurate (within a pixel—30 meters by 30 meters of land) mosaic of Antarctica and has the highest spatial resolution.

3D Search Engines
Princeton (2005) Princeton 3D Model Search Engine. Retrieved November 29, 2005 from http://shape.cs.princeton.edu/search.html

Allows you to search for 3D models based on a sketch (done using JavaScript)

Alternative search engines

 * Alltheweb ( http://www.alltheweb.com )
 * Blekko (http://blekko.com/) Released November, 2010. Human involvement.
 * Gooru ( http://www.goorulearning.org/gooru/index.g#!/home) A Free Search Engine for Learning
 * Ilor ( http://www.ilor.com/searchilor.lor )
 * Profusion (http://www.profusion.com )
 * Teoma ( http://www.teoma.com )
 * Vivisimo ( http://www.vivisimo.com )

Sources include : Bass, S (2003, Aug) Maximum Google : Search Alternatives offer cool features. pp 87-92, NZ PC World, ISSN 0114 7285, IDG Communcations Group

Three parts of a search engine
1. Automatic cataloguer An automatic cataloguer, called a spider, crawler or robot roams around the web following links and reading documents. Any linked document could potentially be visited, including those that the document’s owners would rather the world did not see. The spider returns to the site on a regular basis, such as every month or two, to look for changes.

2. Index

The spider results are then gathered together into a huge index, which contains a copy of every web page visited by the spider. words like and, to, a, or, etc. are removed. If a webpage changes then this copy is updated. Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. A web page may have been "spidered" but not yet "indexed." Until it is indexed it is not available to people searching with the search engine. Some engines "expire" pages if they have not been updated within a set time period.

3. Search software

The third part is the search software which sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. There are usually various filters and options that are applied to the search. Different search websites may be using the same index, for example the Inktomi index "powers" the HotBot, MSN and GoTo engines, though results may be slightly different in each since each uses the Inktomi index in a slightly different way.

Manual registration
It is also possible to register your web pages with search engines. This is particularly useful if your web site is new and is not linked to by other web pages, or if you want to speed up the registration process.

Servers
Because of the huge indexes and billions of search requests each day, search engine hardware has to be incredibly powerful. Search engines are often a showcase for manufacturers. Part of the server room for the Inktomi index is shown.

Search Engine Optimization 101
SEO Logic (2007) Search Engine Optimization 101. Retrieved September 3, 2007 from http://www.seologic.com/guide/

Refining searches
Searching will often give hundreds of thousands of mainly irrelevant results back. To narrow down a search. A search of +windows +2000 +bugs will find pages that contain all three words. However this may find pages where the words are completely unrelated to each other. You might find a window cleaners notes on bugs written in 2000.
 * Use + and AND. Without these the search engine will find pages containing any one of the words. For example say you want to find information on Windows 2000 bugs. If you try a search windows 2000 bugs you will also find documents just about windows, or bugs or that contain the number 2000.
 * Define phrases using quotes. For example a search of "windows 2000" +bugs or even better "windows 98 bugs" which should only find pages that contain that entire phrase.
 * Most search engines also recognise proper names, so you can capitalise these. For example use Windows to find information on the Microsoft operating system and windows to find information on the things you look through.
 * Subtract subjects you are not interested in. For example if "Windows 2000 bugs" finds lots of resources about Windows 98 or Windows 95 then try a search such as "Windows 2000" +bugs -"Windows 90" -"Windows 95".

Advanced Searches
Many search engines also allow you to filter your results further. Often you can:
 * Search by date, e.g. only those pages that are less than a year old
 * Search only web page titles
 * Find pages written in a specific language
 * Find only pages containing specified media types, e.g. images, audio
 * Find only files that end in a certain file extension, e.g. .gif .mp3
 * Find pages that are hosted in a particular domain or geographical area
 * Search by "depth" of page on a website, e.g. only home pages or only sub-pages

Boolean searches
These are searches allowing the inclusion or exclusion of documents containing certain words through the use of operators such as AND, NOT and OR. Boolean search commands have been used by professionals for searching through traditional databases, Library catalogues etc. for years. However they are overkill for the average web search as all the search engines support + an -. The OR command is used for specifying alternatives e.g. New Zealand OR Aotearoa. Another Boolean command is NEAR which will find words that are in close proximity to each other. The actual distance between the words depends on the individual search engine. It is possible to nest Boolean searches using brackets, e.g. impeachment AND (clinton OR johnson).

Natural language queries
These allow you to ask a question using English, e.g. What is the weather like in Auckland? Best for when you are looking for the answer to something, rather than a specific web site, for example What is the population of Ceylon? The best service is Ask Jeeves: http://www.askjeeves.com/, which is used by the AltaVista search engine.

Google
Probably the most popular search engine that strives to index many pages AND improve usability is Google. Began as an academic search engine. In a paper describing how it was built, Sergey Brin and Lawrence Page indicated that with four spiders the system could crawl 100 pages and generate about 600Kb of data per second. Google spider looked at two things,

Words within a page Where the words found (Meta tags, titles, headings)

Google in the 1970's 

Google in 2012

 * Google Throws Open Doors to Its Top-Secret Data Center http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/all/
 * Explore a google data center with streetview: (YouTube video) http://www.youtube.com/watch?v=avP5d16wEp0

Google search tips
For definitions try define:

Search Hints
If you are searching the internet for something specific, you may be able to improve the quality of the results by following the steps below. 1. Enter a URL 2. Use a search tool 3. Narrow the search with a boolean operator
 * Look for US site
 * gardening: http://www.gardening.com
 * Add the country code
 * gardening: http://www.gardening.co.nz
 * Try similar words
 * Flowers: http://www.flowers.com
 * Try an abbreviation
 * Dick Smith Electronics: http://www.dse.co.nz
 * google ( http://www.google.com )
 * gardening AND roses

If you want to find a definition of a word
 * In Google type define:word

Search terms

 * Concept search: A search for documents related conceptually to a word, rather than specifically containing the word itself.
 * Full text index: An index containing every word of every document cataloged, including stop words (defined below)
 * Fuzzy search: A search that will find matches even when the words are only partially spelled or misspelled.
 * Keyword search: A search for documents containing one or more words specified by a user.
 * Phrase search: A search for documents containing an exact sentence or phrase specified by a user.
 * Precision: The degree in which a search engine lists documents matching a query. The more matching a query. The more matching documents listed the higher the precision. For example if a search engine lists 80 documents found to match a query but only 20 of them contain the search words, then the precision would be 25%.
 * Proximity search: A search where the users specify that documents returned should have the words near each other.
 * Query by example: A search where a user instructs an engine to find more documents that are similar to a particular document. Also called find similar.
 * Stemming: The ability for a search to include the stem of words. For example stemming allows a user to enter swimming and get back results for the stem word swim.
 * Stop Words: conjunctions, prepositions and articles and other words such as AND, TO and A that appear in documents yet alone may contain little meaning.
 * Thesaurus: A list of synonyms a search engine can use to find matches for particular words if the words themselves do not appear in documents.

Web spam: Modifying a web page to influence its ranking
Also
 * web spam (Gröngyi and Garcia-Molina 2005; Joachims, Granka et al. 2005; Metaxas and DeStefano 2005),
 * search engine spam (Wu and Davison 2005) and
 * spamdexing (Gröngyi and Garcia-Molina 2005).

Gröngyi, Z., H. Garcia-Molina, et al. (2004). Combating Web Spam With TrustRank. Proceedings of the 30th International Conference on Very Large Data Bases (VLDB ’04).

From Jones, T. (2005). Both Sides of the Digital Battle for a High Rank from a Search Engine. Association for Computing Machinery New Zealand Bulletin, 1 (2) (ISSN 1176-9998).

Image searching
"Search Looks at the Big Picture" Wired News (01/06/05); Gartner, John Computer scientists are working on visualization technologies that would allow image searches to be done without relying solely on text descriptions, such as is currently done with Google, Yahoo!, and other search engines. Image search engines that can identify components of an image will be less vulnerable to manipulation by pornographers or deceitful advertisers that tag pictures as "Britney Spears" or some other popular content. Xerox Research Center Europe and a group of European universities are jointly working on software that recognizes "key patches" in pictures, such as an ocean and beach, or car and tires; since development began in 2002, the software has learned hundreds of objects, and would be useful in distinguishing between dissimilar pictures that share a common keyword. Improved image search would also open up new advertising opportunities, such as a function that allows online shoppers to search for a similar-looking red sweater, but at a lower price than one already found. Such capability would open up a vast new market as companies work to improve their image search rankings for popular products, people, or places. IBM Pervasive Media Management group is also developing visualization technology, but for video; the Marvel system also relies on learned images and categorizes them into concepts, such as travel or sports. IBM is currently working with CNN and ABC to classify news coverage concepts so Marvel technology can be applied to news archives. Search companies have not yet endorsed visualization technology, but Yahoo! is trying to bolster its video search with its Media RSS format and by working with Hollywood studios on metadata tagging.(Gartner, 2005, Jan 6)

Google Goggles

 * Google Goggles - December 12,2009, Image searching for Android (Google Operating System) phones

Related applications

 * Translation facilities (Google, 2009a)
 * Google has found that certain search terms are good indicators of flu activity. Google Flu Trends uses aggregated Google search data to estimate flu activity in US states up to two weeks faster than traditional systems. (Google, 2009b)

Google zeitgeist (http://www.google.com/intl/en/press/zeitgeist/index.html)
Google aggregates millions of search queries and provides several tools that give insight into global, regional, past and present search trends