Gather the knowledge necessary for Natural Logic inference (specialization/generalization from WordNet and LoreKB) or how to make a system that predicts the specialization/generalization of words, such as training a simple classifier from data.
The format that you write it in can be anything that is easy to access and fully captures the information. In general, we would probably want them in EL or ULF universal implicatives (e.g. (all x [[x dog.n] -> [x animal.n]])). If we wanted something more complicated as we might see in the glosses of the words we could still do that in a uniform way (e.g. (all x [[x run.v] -> [x (quickly.adv-a run.v)]])).
First, start with WordNet hypernym relations, you may opt to simply write it into a file that tab separates the relations:
It would be best to initially generate these files preserving the word sense in some way so we don't lose any information.
The hypernym/hyponym relation is a very good place to start, but the information is incomplete so there are other options to expand. Lore knowledge base contains general world knowledge axioms that we could use. These are extracted from text, so they're relevant and not necessarily lexical (so a good pairing with WordNet info). The main issue with that source is that it's noisy, so we'll need to experiment with filtering/rule weighting to best use those axioms for inferences (you'll be able to run such experiments once I get the inference code working).
There are other knowledge bases as well that may have interesting information (e.g. paraphrase database - PPDB, NELL knowledge base, Cyc; I included PPDB because synonyms are also relevant for inferences, so it would be useful to get that information). You can look over those datasets and see if there's any relevant data in them we can mine.
For the classifier, the most straightforward thing would be to try to train a system to take a sentence, a word from the sentence, and a candidate word as inputs and outputs the hypernym relation between the word selection from the sentence and the candidate word, e.g.:
Input : Coco is a dog, dog [span 3,4], animal
Output: hyponym (between dog and animal)
Using a token span rather than a token index (e.g. using [span 3,4] rather than [index 3]) will help us include compound nouns such as ‘university graduate’. So from 'Coco is a dog', Coco[0,1]; dog[3,4] and from 'Mary is a university graduate', university graduate[3,5].
You can generate a dataset by taking a sense disambiguated sentences ([login to view URL] and [login to view URL]). And then use the hypernym relations from wordnet to annotate the sentences. Then with that dataset you can train a system. An SVM classifier would be the simplest to get running with decent results (extract some features such as the sentence words, the words and characters surrounding the target word).
[Notice that we're ignoring polarity, even though it's important. We're planning on handling polarity separately, so we don't have to worry about it here. We'll train the system to ignore polarity, but the sentence context is important to capture the word sense information].
please download lorekb file
[login to view URL]
The file is formatted so that each line is the symbol '(:s', then the weighting of the rule (I think it's the number of source sentences...), some metadata about the rule (pointers to source sentences in a separate corpus -- they're corrupted so you can basically ignore them), an EL formula, and an auto-generated English sentence that corresponds to the formula. My simplification script got lost, but here are a few Python snippets you can use to easily access the data you want.
line = [a line from Lore KB]
KB_LINE_MATCH = "\(:s (\d+) ([^()]+) (\(.+\)) \"(.+)\""
m = re.match(KB_LINE_MATCH, line)
weight = int(m.group(1))
formula = from_lispstr(m.group(3))
natural = m.group(4)
You can use them directly if you're using Python, or write them into a formatted file that's easy to read from whatever other language you're using.
I would recommend building you system by taking a small prefix subset to develop on -- 'head -n 100 [source filename] > [target filename]' will write the first 100 lines to [target filename]. Also the formulas are ordered by importance, so you'll handle the most important cases.
Some of the formulas have operators marked by starting with a colon (e.g. :f, :i, etc). ":i", ":f", ":p", ":q", and ":l" can be thought of as meaning "sentence in infix form" (i.e., with the first argument preceding the predicate), "function application", "predicate application", "unscoped quantifier application", and "lambda abstraction". I would start by ignoring those, but ideally you'll eventually handle these formats as well. They're an alternative representation of ULF that's a bit more explicit about the types (easy for Lisp processing) -- note that quantifiers aren't scoped in this representation, though they are in the other formulas. Also, the formulas are in full EL, not ULF, and the conventions are a bit different. Here are a list of most obvious differences I notice, please let me know if anything remains unclear about the logical forms:
- quantifiers have no suffix
- quantifiers are scoped and the variable quantified is explicit: format (<quantifier> <variable> <restrictor> <body>). The restrictor and body correspond to generalized quantifier theory, if you know about that it might be helpful. Otherwise, they have natural language-like interpretations. [not for formulas with colon operators]
- predicative statements are in infix notation
- '** e' marks the episode that is being characterized. Different 'e's (e.g. e1, e2, etc.) can be used to state relationships between episodes in EL. I *think* every formula in Lore only has a single such variable asserted though.
- the 'pair' operator relates an actor with an episode to make an action (or more precisely some relation between an agent and an episode), which can be talked about separate from the episode itself.
- many operators and predicates are formed from multiple lexical items (e.g. many-or-some, day_of_rest.v, have-as-part.v, etc.)
- some operators are marked by a word sense before the suffix (e.g. person1.n, have7.v). I'm pretty sure these are WordNet word senses.
- possession is marked with the 'have-as' operator (e.g. (x (have-as mother.n) y)). This roughly corresponds to (y = (the.d ((poss-by x) (mother-of.n *s)))) in the current ULF annotation guidelines. In fully disambiguated EL it's simply [y mother-of.n x].
You can read the first 8 pages or so of https://www.cs.rochester.edu/~schubert/papers/el-meets-lrrh.pdf to understand disambiguated EL (maybe I've already sent this to you). It may not be necessary if the formulas make sense to you.
Also, here is a dataset of noun hierarchy axioms generated by the same guy who made Lore http://www.cs.rochester.edu/research/epilog/wn/.