Motivated by this discussion, here is a pile of papers on page-ranking and web graph structure. ... Drop me a line if you know good papers on this subject ...
Google (1998): The PageRank Citation Ranking
This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention
devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages.
What can you do with a Web in your Pocket? (1998)
It is currently possible for a university research project to store and process the entire World Wide Web.Since there is a limit on how much text humans can generate, it is plausible that within a few decades one will be able to store and process all the human-generated text on the Web in a shirt pocket.
Context in Web Search, Steve Lawrence 2000 BS
The Web as a graph: measurements, models, and methods(1999)
We then report a number of measurements and properties of this graph that manifested themselves as we ran these algorithms on the Web.
The Shape of the Web and Its Implications for Searching the Web (2000)
Recent work in the feld has focused on detecting identifiable patterns in the web graph and exploiting this information to improve the performance of search..
Enough for now.
Google (1998): The PageRank Citation Ranking
This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention
devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages.
What can you do with a Web in your Pocket? (1998)
It is currently possible for a university research project to store and process the entire World Wide Web.Since there is a limit on how much text humans can generate, it is plausible that within a few decades one will be able to store and process all the human-generated text on the Web in a shirt pocket.
Context in Web Search, Steve Lawrence 2000 BS
The Web as a graph: measurements, models, and methods(1999)
We then report a number of measurements and properties of this graph that manifested themselves as we ran these algorithms on the Web.
The Shape of the Web and Its Implications for Searching the Web (2000)
Recent work in the feld has focused on detecting identifiable patterns in the web graph and exploiting this information to improve the performance of search..
Enough for now.
Interesting discussion
Date: 2001-06-10 08:45 am (UTC)1. I think both sides may benefit from reading original google page-ranking paper , some other related papers I've put here .
One of the convenient models is that of a Markov chain: webpages -- nodes, hyperlinks -- transitions. Corresponding transition matrix M is the matrix that you discuss. If Markov chain is irreducible (there is a path from each vertex to each vertex) then it has a limit probability distribution L (Frobenius-Perron theorem). One can pick arbitrary prob. vector V [for ex.(1/N, 1/N,...,1/N)] as initial "rank"-distribution and after sufficiently many iterations: M^k*V will be close to L. This L is the "fair" rank distribution that we are looking for.
However web-graph does not lead to irreducible Markov chain, since there are sites that have no
outgoing links. If one takes a random walk in the web one will be finally stuck in one of those. Solution: "Let's jump to a random site if we are stuck" (corresponds to connecting each terminal node to all the other nodes in the graph). This explains the factor d in the formula.
However the graph is still not strongly connected since there can be "sinks". In a simplest case consider two nodes that point to each other and suppose that there is a link to one of these nodes from the outside. At each iteration these two nodes will be accumulating the probability mass..
Google paper suggests a solution of adding vector E which will balance the deficiency of the prob. mass due to sinks. The choice of vector E allows to play with the ranks as you like, see the paper...
2. Google was started as an academic project (Stanford), papers and even source code was on the net. Now it is a commercial enterprise it's know-how is not published. The fact that the issue of ranking is not resolved by a single formula is obvious: if it were so simple all search engines on the net would be as good as Google, but they are not (my opinion).
3. Google is great at ranking [and IMHO he who rules the ranking -- rules the net], it has also a great feature of "cached pages". Also, it can show you all citations made to your cite (can it be done in yandex?)
4. Annoying in Google is its poor query language.
It also searches for the words literally (looking for "apple" will not find "apples"). What is cool in yandex is that it has morph-engine and a rich query syntax.
More ... (here'll be a collection of ideas on ranking)
Date: 2001-06-10 09:56 am (UTC)of the link itself (anchor) can be used.
2. Use of "Back", "Home" and "Bookmarks" is not reflected in the random walk model of browsing. Coefficient d oversimplifies all this.
3. Expander-like properties of the graph and small diameter.
4. Customize search engine: Simple ranking averages all the topics. Info in user Bookmarks can be used to bias the ranks towards user interests.
5. Hub's vs. Authorities: Current ranking only measures autoritative side is it correct?
6. Another useful reference from WWW9
An Efficient Algorithm to Rank Web Resources
Dell Zhang, Yisheng Dong (China)
The result of our rank algorithm, which synthesized the relevance, authority, integrativity and novelty of each Web resource, can be computed efficiently not by iteration but through solving a group of linear equations.
More ideas? Is anybody developing a search engine in the OpenSource? Show up.