Parserator – A family of probabalistic parsers(parserator.datamade.us) |
Parserator – A family of probabalistic parsers(parserator.datamade.us) |
I don't mean that as a criticism of the project itself, though. The demos alone are pretty cool, and the framework looks incredibly useful.
Parsing is commonly understood to infer an hierarchical structure, but that's not required.
Indeed, the Latin root leads us to understand that parsing is about breaking an object into parts. That these parts nest according to a particular grammar is something of an important implementation detail.
This might work better if you had a database of almost every street name and almost every place name. Then you could take in an address, and classify words as one or more of [StreetName, PlaceName, StreetType, etc.]. Some words can appear in more than one of those categories, which is when a deterministic parser without a full database fails. Then let the learning algorithm deal with ambiguities such as "1 Park Lane", "1 Lane Park", and such. You'd have a better chance of dealing with the hard cases. Expecting this to recognize street words on its own is a reach.
You can get about 95% successful parsing of US business addresses with a relatively simple parser that lacks a name database. (I have one running right now on 20 million addresses.) Then it gets hard. Are they doing better than that?
The commercial parsers with full address databases do much better.
"Duzbuns Hopsit pfarmerrsc"
123 1/2 Green Onion, Some City, CA
1/2 is treated as a number suffix (wrong, part of the number), Onion is treated as StreetPost, e.g. a classifier similar to Blvd. or Street.
Miss a comma, and the parser will be completely confused.
So, yay, now my address is going to be probabilistically wrong ;)
What do you need the grammatical parsing for? Identification of named entities?
Is three given names followed by a hyphenated family name.
Not a corporation.
I just made it up, but it is representative of a substantial subset of the English speaking population of the world.
Have looked at Spanish naming conventions?
Without a substantial amount of context I think that such parser cannot be made to be useful.
> Alex, GivenName
> Chamberlain, Surname
I'd be interested in following your progress. You can send me a mail, so we can connect.