Websearch Algorithims

What is Page Rank?

Page rank is a way to rank webpages. When something is searched, the order in which the sites are ordered might be determined by the page rank algorithm. It is the algorithm for website ranking currently used by Google. Website connections are described using a graph. The page rank works in iterations.

Page Rank Algorithm:

  1. First, for every node in the graph, put its initial page rank value as 1/n, where n is the number of nodes in the graph.

  2. Then, in the next iteration, go through each node (call this node X). For every node Y that points towards node X, use the equation Page Rank = sum( page rank of y / connections of outgoing node y edge).

  3. Then just keep repeating this process. In theory, this process can go on forever, so just stop when the page rank values are barely changing.

Cheating the Page Rank algorithm:

Page rank, although not perfect, prevents a lot of websites from being “boosted”. It is worth so much more when a creditable website (with a high page rank) connects to a different website than a bunch of not-creditable websites. So, the use of spam websites to boost one central page’s score isn’t very effective.

What is the trust rank?

Trust rank is an algorithm used to determine how much a website can be trusted. It can be used to detect spam websites and remove them. In trust rank, a set of trusted websites is made and then used to determine the trust of all of the other websites.

Trust rank revolves around “Approximate isolation,” which means it’s not likely for trusted websites to have connections to poor websites.

Trust Rank Algorithim:

Unfortunately, determining a set of “good” websites is a potentially expensive process. In some cases, a human is used to determine whether a website is trusted or not. Possibly websites with trusted extensions like.edu might be used. In other cases, when a topic is searched, the top pages according to page rank are used. All the websites checked will be put into a seed set. The trusted website contained in the set will be called a teleport set.

Every website in the teleport will promote trust in the other website. Each page will have a trust value between 0 and 1.

Trust Attenuation: The degree of trust given by a trusted site decays as the website is further from the site.

Trust Splitting occurs when a website has a lot of links going to different websites. The amount of trust given to the website is split. More outlinks = less trust for each link.

If the trust value of a website is below some threshold number, remove it.

Spam websites:

Spam mass is the portion of the total pages connecting to this page that are spam pages. The equation for calculating spam mass goes as follows:

Page Rank from spam websites = (Page Rank) – (Page rank assigned by trusted websites)

Spam mass = (Page rank from spam websites) / (Page rank)

What is the HITS (Hypertext-induced Topic selection) algorithm?

The HITS algorithm is able to know hubs and authorities. The HITS algorithm is also used to determine how good a website is, just like page rank. In hubs and authorities, there are searches for trusted website linkers (call these experts). It uses links as a way to quantify how important a website is.

HITS Algorithim:

Instead of relying on just one value, like page rank, HITS has two values. It has an authority value and a hub value. The quality of an expert is determined by the value given by the authorities (hub). The quality of a content distributor is determined by the value given by the authorities.

Authorities are usually websites such as newspaper home pages, and hubs are websites such as a list of newspapers. Authorities are websites used to point towards hubs.

This situation can be visualized as a bipartite graph. Where hubs are one disjoint set and authorities are another.

The steps for conducting the HITS algorithm go as follows:

  1. Start by calculating the hub scores based on how many authority sites are pointing towards it (4 in degree = 4 hub scores).

  2. Now do it in reverse: use the hub scores to calculate the authority scores. The authority score is the sum of all the hub scores of the hubs that it points to.

  3. Continue this process until the change in hub and authority becomes obsessive.