That's great way . Yes, brute-force has a low speed, but is to simply generate all possible routes...and seems now to be only way for the semantic relatedness problem.
Relately to my article at CP, currently I am more concerned on Semantic similarity, which is a special case of semantic relatedness(as you have presented above). The semantic similarity only considers the IS-A(kind-of) relation (hyponymy/ hypernymy for noun and troponymy for verb).
Compute path distance between two words a and b is to searching the connection path between them in WordNet. This can be done by searching the paths from each sense of a to each sense of b , and then select the shortest path. PathLength is measured in nodes rather than links. So the length between siblings or sister nodes is 3, the length between two member of the same synset is 1.
In example: a hyponymy relation in WordNet
- Code: Select all
conveynance, transport ware
vehicle table ware
wheeled vehicle cutlery, eating utensil
/ \ |
automotive, motor bike,bicycle fork
car, auto,... truck
Looking at this tree(sorry if it looks bad but it takes 5 mins to draw the tree
), the leght between "car" and "auto" is 1 because they both belong to the same synset. the length between "car" and "bike" is 4. length between "car" and "fork" is 12.
Personally, I think the path length above gives us a simple way to compute relatedness distance between two words. Some issues need to be addressed:
- Lemmatization : when looking up a word in WN, the word is first lemmatized. So the distance between "book" and "books" is 0 since thay are identical. "Mice" and "mouse" ? This can be done by using Morph.cs ... I've not tried this with morph.cs
- The path length just only compare the words which have same part of speech(POS). This means that we don't compare a noun and a verb because they are located in different taxonomy trees. and I just consider the words that are nouns , verbs, or adj, respectively. We will use lexical of Jeff Martin, when considering a word, we first check if it is a noun and if so we will treat it as a noun and its verb or adj will be disregarded. If it is not a noun, we will check if it is a verb...
- Compound nouns: like "travel agent" they will be treated as single word through the tokenization.
We have many measures to compute the similarity based on path length such as Leacok-chodorow, Wu-Palmer, Resnik. The path length measures have the advantage of that is independent of corpus statics...but they are also not much successful.
Beside the fomular I've proposed in experiment of the article at CP, I think that there is a one simple similarity path measure :
Sim(s1, s2) = 1 / dist(s1, s2).
Where s1, s2 is two synsets of words a,b respectively. dist(s1, s2) is path length between s1 and s2.
Just concerning on the semantic similarity, do you agree with me ?