Cristina - web design         Cristina's®        Capt. Peter - web design

Latent Semantic Indexing

Latent semantic indexing is a technique for analyzing and indexing documents.
It means increased relevance and speed when retrieving data from a data bank.
This page explains basic latent semantic indexing and how it relates to the Semantic Web
as well as how webmasters and website owners can use it.
It is part of Don Pedro's Website Design Handbook.

Site Goldaward - Pakistani Maritime   International Association of Webmasters and Designers
Site Gold Awards for Excellence on the Web in 2004
Classification: Maritime, Marine, and Boating

At the bottom of the page, there is a link to a print ready version. What is Latent Semantic Indexing ?
What is the Semantic Web ?
What is "On-topic" Analysis ?
Yahoo's Patent Application
How can we Utilize Semantic Indexing ?
Latent "Semantics" Summary
This page is best in any browser

Last up-dated: Aug. 31, 2010

What is Latent Semantic Indexing ?

Latent semantic indexing is a technic used by many search engines to group related documents and words (and websites) together in clusters as belonging to the same themes (topics). For this they (like Google and MS Live) use latent semantic analysis, which comes from analysis of natural languages.

It's a mathematical way to describe a document by listing words and/or phrases used (in a matrix). The value for a term or a word in each matrix cell is then proportional to the number of times each term appears in the document. Latent semantic analysis transforms the relationships between words and concepts into relationships between concepts and documents. With this relationship you can do the following:
  • compare documents,
  • find similar documents in different languages,
  • find synonyms [ in this case related words and/or phrases ],
  • from a certain search term ("key word or key phrase") find relevant documents
The synonyms you or the search engine "finds" this way are not the same a dictionary would give. These are related words and/or phrases often used together within documents belonging to same topic - "semantic synonyms" in contrast to "dictionary synonyms".

Latent semantic analysis isn't used only to find related words but also to find related phrases. As a "normal good" - spam free - document generally contains only a limited number of related phrases, while a spam document contains an excessive number (may be much over a hundred), this technique can also be used to detect spam. It's rumoured Google and Yahoo are using it this way too.

As semantic analysis spreads among search engines the automatic programs start "judging" the quality of webpages and websites. Because of this small details grow in importance. A computer program doesn't make difference between important and not so important errors. An error is an error.

What is the Semantic Web ?

The "father" of the Internet and the Semantic Web, Tim Berners-Lee states the Semantic Web is:
QUOTE: "... about creating things from data you have compiled yourself, or combining it with volumes (data bases) of data from other sources to make new discoveries." END QUOTE
The goal is to share and process data automatically, i.e. by computers, instead of manually combining documents found by the computer. The Semantic Web is about data, not about documents.

It is not marking up existing HTML documents, and it is not about applying artificial "intelligence". The Semantic Web is, according to Tim Berners-Lee, about data currently in relational data bases (like search engine's data banks), XML documents, spread sheets, and other format data files. "It is not about people encoding webpages". It is about applications (programs) generating machine-readable data on an entirely different automated scale.

The Semantic Web therefore doesn't require content and webpage owners to individually encode their information. The great bulk of data suitable for the Semantic Web is already sitting in data bases.

What is "On-topic" Analysis ?

"On-topic" analysis is what the search engines do when comparing key phrases in Internet documents ( i.e. webpages ). By analyzing hundreds of documents it's possible to find sentences that tend to occur together in "good" documents ("co-occurrence").

Once the search engines have made their own directories of co-occurring sentences in specific topics these same sentences are used to determine if a new document belongs to a certain topic ( theme ) or not. Usually one or a few sentences from a group can be used to forecast the presence of other sentences or concepts from that same topic. If those other concepts are not found in a document, may be it's deemed not to be a "good document".

The result of this kind of document analysis is that you have to stay strictly within one subject on each webpage not to "blur" the theme. There is a discussion on Webmaster World whether Google is extending the "on-topic analysis" to cover complete websites ( or domains ). If that's the case, then reducing number of "ill-fitting" or non-relevant ( out of topic ) words and phrases should give better ranking for all pages within that website.

In August/September 2007 Google published their Touch Graph Browser, which gives a "picture" of clusters of conceptually related websites. A conceptual analysis like this is based on latent semantic analysis. It can help you to check the linkage and the neighbourhood of your own or somebody else's website.

To help you to determine if your webpage is "on-topic" or not you can use key word analysis. It's quite easy to see whether all your top keywords and top key phrases are within topic or not. If they are not, may be you have to split the webpage into two - with one specific topic or sub-topic each.

Yahoo's Patent Application

Yahoo's patent application [Dec. 2006] was published in February 2008: "System and Method for Determining Concepts in a Content Item Using Context". It's all about automated use of phrases to rank search results as to relevance in respect to search queries. I have before on another page ( How Find the Best Keywords ? ) stated my opinion: "It's better to use key phrases instead of keywords".

The intention is, according to the patent application, to take into account how a phrase or concept is related to other phrases or concepts in the same document ( webpage ). After that certain phrases or concepts are associated with a certain webpage for indexing purposes. This is another explanation of semantic indexing. The key phrases are then compared with a data base list of user queries. The phrases are also identified in the way they are related to certain topics ( co-occurrence ).

Based on the above the ranking algorithm calculates a value number for the relevance of a page in relation to a certain concept. The frequency of a phrase's occurrence can even be compared with average occurrence in other web documents and the query logs.

So it's getting clear. Once Yahoo gets to applying this, we need to use related phrases instead of keywords when optimizing our webpages. There is really no difference between Google's and Yahoo's approach.

How can we Utilize Semantic Indexing ?

We (webmasters, designers, website owners) don't need to use semantic indexing, the search engines will do it. What we can do, is to utilize this knowledge when building our websites and webpages.

As noted above semantic indexing means the search engines start recognizing related words and phrases, depending on a webpage's (or website's) theme (topic). Not only recognizing but even expecting certain words and certain expressions (key phrases). So instead of using "dictionary synonyms" we would be better off using "semantic synonyms" instead.

Ah, but how? As said we need to use "related concepts" and then find "semantic synonyms" for these. Now we are coming to what is being called siloing. We start building mini-networks among the pages on our websites. As anchor texts (link text) we use these "semantic synonyms" and then optimizing the webpages in each mini-network for those related words and phrases, i.e. "semantic synonyms".

And how to find these words ? Go for instance to Google, as search term you type "~" before the word or concept for which you want Google's "synonyms" [without the "quotation marks"]. In the result you get some words in bold - these are the "semantic synonyms" you are looking for.
Facebook Buttons By ButtonsHut.com
Cristina's Website
Design and Promotion

And how to find these words ? Go for instance to Google, as search term you type "~" before the word or concept for which you want Google's "synonyms" [without the "quotation marks"]. In the result you get some words in bold - these are the "semantic synonyms" you are looking for.

Example: For the word "~employment" Google gives (the first 10 results pages): job, work, career, job opportunities, recruitment, employer, employment career, job search, search for work, vacancies, employees. (these are all words in bold)

Starting March 2007 you can go to Google labs and you get "related topics". Usually the "small set" is enough. Give at least three examples to keep suggestions within your topic. These you could use as subtopics, i.e. webpage topics, each related to your website's main topic.

Please note; these words and phrases are all found on webpages. Among these are a few I wouldn't expect to be "synonyms" like: employer, employment career, job search, and search for work. These are not necessarily words used by searchers, but probably are. These are semantic synonyms regarded by Google - and possibly other search engines - as relevant related words and phrases to be expected in a serious document about finding a new job.

To use words that aren't necessarily highly searched for, but expected by the search engines makes your webpages stronger in the ever increasing competition. When you present a webpage to the search engines and you include in the text words the search engines would expect in a good document within your topic (theme) you in fact make your webpage more attractive for the search engines..

When you use the plural forms of substantives, the singular is automatically included - therefore not necessary to include both plural and singular in the "keywords" meta tag ("key phrase" meta tag).

Latent "Semantics" Summary

Google (and Yahoo & MS Live) applies their own latent semantic analysis based search paradigm for indexing good websites. Especially they use their own lists of "semantic synonyms", which are related words and phrases depending on the topic and expected in good documents within that topic (theme). These are the words and phrases you get when searching with "~".

Regular webmasters and website designers doesn't really have to worry about the Semantic Web. That one is about data, while the Internet is about documents. Of course, to be successful with one's website it is recommendable to use this knowledge. As with everything else concerning search engine marketing this is also slow work. Even "siloing" isn't any "magic wand". There is no instant or immediate shortcut to long term success.

When looking at these "semantic synonyms" one should keep in one's mind the actual words and phrases used and therefore relevant, are different in different world regions. For instance same idea or concept can be conveyed with different words or phrases in British, American, and Australian English. To find the good ones to use you have to use a local search engine in respective area.

So what about Semantic Search ? The "Hakia" search engine people try to answer that in "10 Things that Make Search a Semantic Search".















Locations of visitors to this site
Home   -   Site Map

Free Backgrounds

Free Pictures

Website Design Handbook

What's No-Index ?

Computer Viruses and Worms

Hide Your E-mail Address

How to Choose Website Colours

How to Change my Pictures and Photos

Reduce Picture Size

Reduce Picture File Size

Reduce Download Time

Increase Picture Size

How Protect my Pictures

Webpage Optimization

Find Best Keywords

SEO Check-List

Website Promotion

Search Engine Marketing

List of Search Engines


Website Design and Promotion Search
Powered by Google


Get version (3 pages small font, 4 pages normal)

© by Cristina and Peter Forsberg.
You are allowed to print out the text for your personal needs.
You are also allowed to copy and distribute the printout for educational purposes when free of charge,
as long as you give the source: www.donpedrowebdesign.com/latent-semantic-analysis.html.

Related pages:
| Search Engine Marketing | Meta Tags and Search Engines |
| How to Find the Best Keywords | XML Sitemaps |

Last updated:
Aug. 31, 2010

Visitor counter
since Oct. 10, 2006
according to: www.digits.com/

eXTReMe Tracker