INFORMATION SYSTEM
OLD QUESTION BANK
IS CASE STUDY TOPICS
IS PRACTICE QUESTION

Link analysis serves as an analytical technique employed to identify, assess, and comprehend connections within data. This method is utilized for three primary purposes:

a. Identifying matches in data based on known patterns of interest.

b. Detecting anomalies when known patterns are breached.

c. Uncovering new patterns of interest, such as in social network analysis and data mining.

The internet, beyond being a mere collection of documents, comprises diverse hyperlinks. When a link exists from page A to page B, it may signify that A is related to B or that A is recommending, citing, voting for, or endorsing B. Links can be categorized as either referential—providing directions like "click here to return home"—or informational—providing additional details with a directive like "click here for more information."

The two most common link analysis algorithms are:

a. PageRank

b. HITS

 

  1. PageRank

PageRank stands as an algorithm employed by Google Search to assess and rank websites within their search engine results. This algorithm is named after Larry Page, one of the founders of Google, and is instrumental in measuring the importance of website pages.

According to Google's explanation, PageRank operates by evaluating both the quantity and quality of links pointing to a page, thereby providing an approximate gauge of the website's significance. The underlying assumption is that websites deemed more important are likely to garner more links from other websites.

The origins of PageRank can be traced back to Sir Tim Berners-Lee, who initially recorded new websites in a notebook. As the number of websites increased, those containing links to new sites were identified. With the growth of the web, search capabilities were incorporated, eventually leading to the creation of search engines. Hence, the inception of search engines evolved from the practice of storing new websites and recognizing links within the burgeoning web landscape.

Algorithm 

The PageRank algorithm generates a probability distribution that signifies the probability of a person randomly clicking on links and arriving at any given page. This algorithm is applicable for calculating PageRank on document collections of varying sizes. In several research papers, there is an assumption that, at the outset of the computational process, the distribution is uniformly divided among all documents in the collection.

The computation of PageRank necessitates multiple passes, referred to as "iterations," through the document collection. These iterations are crucial for adjusting the initial approximate PageRank values, aiming to bring them into closer alignment with the theoretically accurate values.

Page Rank: Formula

PR(A) = (1-d) + d(PR(T1)/C(T1) + ... + PR(Ti)/C(Ti))

Where,

PR(A) is the PageRank of page A

PR(Ti) is the PageRank of pages Ti which link to page A

C(Ti) is the number of outbound links on page Ti

d is a damping factor which can be set between 0 and 1.

 

  1. HITS


The Hyperlink-Induced Topic Search (HITS) algorithm, developed by Jon Kleinberg, is a link analysis algorithm utilized to assess and rank web pages based on their relevance to a specific search. This algorithm leverages web link structures to discover and prioritize web pages in response to a given query.

The HITS algorithm involves the assignment of two scores to each page:

i. Authority Score:

The authority score gauges the value of the content on a page.

ii. Hub Score:

The hub score evaluates the value of a page's links to other pages.

The algorithm follows these steps:

  • Initiate with the user-supplied query, forming an initial set S of pages, known as the root set. This set is expanded to a larger root set T by including pages linked to or from any page in the initial set S.
  • Associate each page p with a hub weight h(p) and an authority weight a(p), both initialized to 1.

Iteratively update the hub and authority weights of each page. Notation p → q signifies "page p has a hyperlink to page q." The updates for hubs and authority are as follows:
a(p) = ∑p→q h(q)

  • h(p) = ∑q→p a(q)

These iterations refine the hub and authority weights, providing a more accurate representation of the relevance of webpages to the given query.