Proselint

Proselint(proselint.com)

398 points by g1n016399 10 years ago | 137 comments

IanCal 10 years ago |

This sounds interesting. As a bit of constructive criticism, please put some examples high up.

You tell me it does cool things. Great, show me. I've looked about on the various pages and can see only one example and I don't understand it:

    text.md:0:10: wallace.uncomparables Comparison of an uncomparable: 'unique' can not be compared.

What's the context of this, what's the error it would have caught in my writing?

The tool is in a perfect place to show this off as it's text.

suchow 10 years ago | |

Good idea. If you run `proselint` without specifying a document, it'll run on the demo text, which you can also access here: https://gist.github.com/suchow/c7856f21128aee89ad55. Also, there's a live demo available at: http://proselint.com/write. It's been tested only on the latest version of Chrome, and I doubt it will handle the load here, but give it a try.

Vivtek 10 years ago | |

This would catch something like "even more unique". In fact, looking at the code (https://github.com/amperser/proselint/blob/master/proselint/...) it would even catch something like "extremely unique", which I've been guilty of using.

But yes, there should be examples on the front page.

joosters 10 years ago | | |

So this program is like having some insufferable pedants arguing over your language? Great!

Does it accept 'nearly unique' ?

conceit 10 years ago | | |

So, I think english is an "analytical language", although I wouldn't know what it means, it inspires me to assume, that by analyszing the sentence you can make out that extremely could refer to something else than the uniqueness, isn't it? EG the phrase could mean, something was unique because of one of any number of extremes, unique by extremity. Sure, that should be uniquely extreme, but what I said shows something else. If there are different qualities that could be unique, wouldn't it make sense to quantify that? Of course, if a logic is so weak it cannot have the peano axioms, you cannot advance beyond uniqueness. (What about missing all-quantor in propositional logic and uniqueness in predicate logic? I'm just stabbing in the dark, really.)

https://news.ycombinator.com/item?id=11239261

SixSigma 10 years ago | |

One cannot have gradations of uniqueness

https://www.youtube.com/watch?v=kdZtM3_Lcy4

conceit 10 years ago | | |

Imagine an object that is unique in exactly one respect, and another in two. Obviously, the other has more uniqueness. Now, if a the first has precisely only the one characteristic of having not really any characteristic at all, then that's a totally different degree of uniqueness. So either is more unique in a specific respect. But Arguably, the nothingness is most unique, if not the only really unique thing. So if we can ignore that, because no one in his right mind would talk about nothing, we can most of always readily conclude, that the first type of a collection of unique'ish things is concerned.

Is this regression ad absurdum or argumentum ad silencium?

shawabawa3 10 years ago | |

Some people feel you should never ever say things like "more unique", "most unique" etc

Which I think is equally as misguided as trying to force "data" to be plural, and that "less than 3" is wrong

ScottBurson 10 years ago | | |

> Some people feel you should never ever say things like "more unique", "most unique" etc

I am among them. Here's why:

(1) There are already other words that express related concepts that are subject to gradation: "rare", "special", "unusual", and "extraordinary" come to mind.

(2) The original meaning of "unique", namely "one of a kind", is an important concept. If we let the word's meaning get lost, we will not be able to express that meaning as easily.

amelius 10 years ago | | |

In a mathematical context, something is either "unique" or it is not. There is no in-between state.

But you can easily define it to mean something else. And you can even make "uniqueness" comparable.

mbrock 10 years ago | | |

It's a linter, it's going to have some kind of "false positives." Maybe you could put an annotation that tells the linter you're sure that you mean it.

Semi-off-topic, but the notion of "more unique" reminds me of Sapolsky's TED talk about humans as the "uniquiest" animal.

https://www.ted.com/talks/robert_sapolsky_the_uniqueness_of_...

jonstokes 10 years ago |

I'm a writer and editor, and I dislike the idea of this tool quite a bit.

1. Writing isn't coding. In coding, you can do various types of "cargo cult programming" and "copypasta" and what-have-you -- in other words, as long as the code runs you don't necessarily have to know why or how a programming idiom or convention works, or how/why expressing it one way in code is better than expressing it another way in code. This definitionally untrue with writing. If you don't know the why/how of something, then it's better for you to botch it and let the reader attempt to parse it so at least they know what they're dealing with and how to interpret it ("oh, this guy's a non-native speaker, so I'll adjust my reception accordingly" or "ah, this person is kind of clueless about the whole sexist language thing, which is good info for me.").

2. 90% of writing style advice falls into one of two categories: a) hotly debated, and b) totally wrong. Most of it is in the latter category, and this includes Strunk & White (just use google for numerous takedowns of that text). I looked through the PR queue and saw that it consists of eager coders finding style advice from various sources and trying to work that into the tool. That is terrible, terrible, terrible... This will guarantee that the tool will represent a collection of awful writing advice gleaned from dubious sources and wielded with unforgiving ignorance.

This tool may be a terrible idea, but the idea of automated prose linting is not terrible. Most beginner to intermediate writers have tics, and as an editor I often have a couple of writer-specific find/replace things I do when I get a new piece from a particular writer (e.g. "this person uses 'however' when she means 'but', and this person overuses these four business jargon terms, etc.). If editors were able to easily compose and execute writer-specific linters from within something like Wordpress, that would probably be pretty great.

But this particular command line tool is destined to be either totally unused or massively abused.

I'm sorry, I hate to be mean... or, actually, there is a small part of me that enjoys playing Mr. Party Pooper when I see a mob of enthusiastic programmers trying to tie down some great cultural Gulliver with a thousand tiny little automated, black-and-white rules.

rosser 10 years ago |

I can see a lot of value for this sort of tool, and might even play with it myself, for sake of evaluating whether or not to incorporate its suggestions into my writing. At the same time, however, I have some wariness that its widespread use could actually have a shaping, and, specifically homogenizing, effect on language. For me, a large part of the beauty of language is how facile it is, how judiciously breaking its rules can create a more artful and compelling means of expression than linted — if you will, "prosaic" — prose seems likely to offer.

dcw303 10 years ago |

This sounds promising, but I think a lot of potential users would be deterred by the lack of examples.

This positively screams for a online interface to test drive.

train_robber 10 years ago | |

http://proselint.com/write/

biturd 10 years ago | | |

Are you claiming you can paste in your own copy, and it will run against it? I see no text area in Chrome or Safari, what am I missing?

pron 10 years ago |

Probably a stupid nitpick, but this bothers me:

> detecting grammatical errors is AI-complete, requiring human-level intelligence to get things right.

(emphasis mine)

First, there's a problem of usage. When in CS we say that a problem is class-complete (like NP-complete), we mean that the problem belongs to the class (which in this case is true, because human-level intelligence can check grammar), but also that it is class-hard, which informally means "at least as hard as the hardest problems in class", and more formally means that any other problem in class can be cheaply reduced to the problem, and so finding a suitable solution to the problem is identical to finding a suitable solution to all other problems in class. Not only checking grammar not known to be "AI-complete" then, we don't even know that human-level intelligence is necessary to solve it.

But the reason this bothers me even though I fully understand the statement was made informally, is a little deeper than that: we don't even know what "human-level intelligence" (or intelligence in general) is, let alone what AI means. That people refer to AI as if it's a thing rather than a very vague notion, clouds how people think of AI research as well as intelligence. I would have simply said "we don't know of good algorithms to dependably check grammar, and this appears to be a very hard problem that may require intelligence".

MichaelBurge 10 years ago |

If you're on Ubuntu, you want to run 'pip3 install proselint' rather than 'pip install proselint'.

I ran it on a couple 800 word emails and it didn't catch anything except me using 2 spaces instead of 1 in one place. I also ran it on my city's sidewalk maintenance ordinance, and it didn't report anything.

mdpacer 10 years ago | |

Part of the goals of proselint is to minimize the number of false positives that traditionally clutter the results of style checkers, resulting in users ignoring the changes when they see them. We want to be reasonably certain before raising an alarm. You can read more about the precise metric[^fn1] we use here: http://proselint.com/lintscore/.

And yes, `python3` for the win. :)

[^fn1]: If you wanted to be truly precise, it's a parametric family of metrics.

czechdeveloper 10 years ago |

Does anyone know about similar tool for scientific papers? Specifically to help non native English speakers to write high quality scientific papers?

rodion 10 years ago | |

Something along these lines:

http://matt.might.net/articles/shell-scripts-for-passive-voi...

https://github.com/bnbeckwith/writegood-mode

MatthewWilkes 10 years ago |

While the idea is interesting, I do worry about the proliferation of linting to prose. Especially the hint about authoritative near the end of the article. Linters turn guidelines into steadfast rules in programming, removing all ability to use judgement if you want your PR merged. I personally want less of that, not more.

pablasso 10 years ago | |

How is standardization a bad thing in programming? in prose I can see the argument, but in programming you should always aim for standardization for code maintenance.

MatthewWilkes 10 years ago | | |

For example, the Python best practices document recommends 1 blank line after functions and 2 after classes. Linters enforce this. However, this can be a detriment to readability in some cases, such as closures or classes that have no body, only superclasses.

Some might say you can mark lines as not being linted, but that then makes the change vulnerable to bikeshedding. For some people, being able to force the conversation to not happen because the linter is authoritative might be good, personally I prefer to follow the guidelines but be aware of the fact that they are there to aid in understanding for future coders not to adhere to a standard.

kbenson 10 years ago |

Ah, another part of my brain I can offload to an external source. It will be interesting when we get to "social-lint", so those of us that are no good at social interactions (through lack of ability or lack of willingness to spend the effort to combat that with ) or that feel they spend far too much brainpower on social interactions to make up for lack of natural ability can benefit.

yitchelle 10 years ago |

Can someone explain in layman's terms how this is any better from an app like the Hemmingway Editor [0]? Both analyses the text and makes suggestions to make it better.

[0]- http://www.hemingwayapp.com/

suchow 10 years ago | |

See our discussion of this at http://proselint.com/approach/. I'll note that we do not consider Proselint a complete product — it's in its earliest stages, perhaps at 2% of its final capacity. That number has steadily decreased as we learn more, which we take to be a good sign.

hk__2 10 years ago | |

Hemingway is an editor while Proselint is a tool. The latter can be integrated in any editor. That’s the main reason I ditched Hemingway (the editor) because I couldn’t just copy/paste text in it to get some suggestions.

banach 10 years ago | | |

In what way were you not able to copy/paste into Hemingway to get suggestions?

squimmy 10 years ago |

I question how useful a tool like this is for a skilled writer.

Prose isn't code.

Many key elements of good writing are based around the idea of knowing the rules, and then carefully breaking them.

vpontis 10 years ago |

Can someone who has tried this share their experience?

It sounds really awesome but it's very hard to tell if it's going to be more annoying or more useful. Maybe it would be useful to have some example linting errors on the homepage.

Either way, I really love the idea!

vpontis 10 years ago | |

Hmm, I tried it out. Doesn't seem too useful yet and there is some polishing to be done so hopefully this continues to go through further development!

One needed improvement: display the offending line on errors. Then you don't have to toggle between file and console to contextualize the errors.

gsabo 10 years ago | |

I ran some of my recent emails through it. It picked up my overuse of exclamation marks and my use of "all of the time" instead of "all the time." It definitely doesn't seem to sensitive - I would lint all of my emails with it if it were easy to do so.

stared 10 years ago |

Is it already in Atom or Sublime Text?

EDIT: I must be blind - they say about ST plugin (although they don't link to it). https://packagecontrol.io/packages/SublimeLinter-contrib-pro...

vikeri 10 years ago | |

"There’s a plugin for Sublime Text." Didn't see anything about Atom though.

synthmeat 10 years ago |

Here's a suggestion...

Have copy on web site be intentionally incorrect, red-underlined with (small modals? tooltips?) that show what's been corrected/suggested by the tool.

aroberge 10 years ago | |

Like http://proselint.com/write/ ? ... which is also editable

gepoch 10 years ago |

See also write-good: https://github.com/btford/write-good

ayushgta 10 years ago | |

Gitbook has open sourced their proofreader at https://github.com/GitbookIO/rousseau

nmstoker 10 years ago |

Looks really interesting. I'd done some preliminary investigation into whether this kind of concept might work for the style guide at my company, but I never got time to take it further.

Is there any word on business model / the intentions of the developers? Is it something that's being open sourced and then integration assistance would be commercialised?

kmfrk 10 years ago |

This is very cool and needed, thank you.

Could you include a sample .proselintrc? rc files tend to have very different opinions on how to be formatted: dictionaries, JSON, bash-argument syntax, and so on. (EDIT: Ah, found one: https://github.com/amperser/proselint/blob/cd428bb0ecc5530c1.... Can’t quite get it to ignore butterick, though.)

I find it a little curious that you use a Markdown example and lint for curly quotes and unicode ellipses by default (butterick), since Markdown discourages such pre-formatting in its syntax, but that’s just hairsplitting, of which I can tell by your swelling Issues count that you have plenty of as it is. :)

Looking forward to some formatting/syntax highlighting in the CLI output, but I know you have your hands full as it is.

joncp 10 years ago |

Tried it with "I'm better then you" and it didn't complain.

Nice idea, but you need to catch homophone errors.

raphman 10 years ago |

Are there any plans to support rules for texts written in other languages (e.g., German)? Would a set of such rules fit within the scope of this project or is proselint purposely or inherently limited to English prose? (@suchow)

suchow 10 years ago | |

It's out of scope for now, but only because we don't have any native speakers of other languages helping us out with the project, and this stuff is hard enough to get write in your native tongue; otherwise it's on the table. Interested?

raphman 10 years ago | | |

I'd certainly contribute a few rules for German prose. Actually, I'm even more interested in using proselint with custom rules for theater plays (e.g., check for unneccessary repetitions, word combinations that are (acoustically) hard to understand).

As czechdeveloper has pointed out in this thread, it would also be nice to have a set of rules specifically for academic writing and/or for non-native speakers (e.g., Asian scientists seem prone to overuse "the").

I guess, a first step would be to have an extensible set of tags for the rules - both language-specifying ones (i.e., any_language, american_english, british_english, german, ...) and genre-specifying ones (any_genre, prose, poetry, academic, technical, ...). Furthermore, an easy way to select a subset of rules by tag (e.g., british_english and academic) would be neccessary.

Would that fit within your goals for proselint?

Singletoned 10 years ago | | |

> this stuff is hard enough to get write in your native tongue

Was that deliberate?

segphault 10 years ago |

The main problem with a tool like this it that it needs to understand sentence structure in order to find a lot of common anti-patterns. Without some natural language processing, it's just going to be able to scan for word usage and simple things that you can catch with a regex. You could probably build something a lot more sophisticated on top of something like Apple's NSLinguisticTagger and related APIs.

After testing this against a dozen of my blog posts, I'm not terribly impressed with the output. I get more immediate value out of MarkedApp's keyword drawer and word repetition visualization.

suchow 10 years ago | |

You're right, but the problem is much worse than that. Examining 200 entries from Garner's Modern American Usage at random reveals that half of them are easy to implement, the kind of thing that could be assigned as a homework problem (e.g., recognizing that “$10 USD” is redundant, that “very unique” is comparing an uncomparable adjective, or that people from Michigan are called “Michiganders”, not “Michiganites”). Thirty percent are moderately challenging, requiring a week’s effort. Fifteen percent are hard — they are entire projects, requiring advances in AI. And the remaining advice (around five percent), the best kind, is AI-complete. Consider, e.g., "John hit Peter only in the nose". Does this mean that, of all Peter's body parts that could have been hit, John hit only Peter's nose? Or is it a grammatical error that was suppose to convey that, of all the people John could have hit, it was only Peter who he did hit.

We're interested in incorporating deeper NLP. In particular, we've been eyeing https://github.com/spacy-io/spaCy.

gansai 10 years ago |

Will this be used by automated content creators? For example, lots of articles on some of news websites (including wikipedia) are written by bots. So the bot would write an article, invoke proselint and correct, if required?

kaeluka 10 years ago |

Related: artbollocks-mode https://github.com/sachac/artbollocks-mode

vortico 10 years ago |

I was skeptical that it would only detect obvious issues, but the sheer number of built-in checks is surprising. I'll try this on the next large text I write.

jake-low 10 years ago |

I've been interested in linters and style checkers for English prose for a while, and I'm excited to try this out!

To the author(s): Your website, as far as I could tell, doesn't tell me how to install it; I had to go to GitHub to realize it was pip-installable. You should consider adding that to the main page.

chei0aiV 10 years ago | |

The authors probably aren't reading HN, best submit a PR.

suchow 10 years ago | | |

We are. Even so, opening issues on Github and submitting PRs is appreciated.

kylemathews 10 years ago |

Nice idea.

Bug report — it told me I had too many exclamation marks in a Markdown file with a number of images in it.

Tepix 10 years ago | |

Sounds like a feature request ("recognize and support markdown"). Open an issue at https://github.com/amperser/proselint/issues/new

timlyo 10 years ago |

Going through the example, it comes up with:

> Get that off of me before I catch on fire! > Needless variant. 'catch fire' is the preferred form

I don't think I've ever heard anyone say "catch fire" rather than "catch on fire".

From the UK if that changes anything.

reikonomusha 10 years ago | |

"To catch fire" is a relatively common term, at least in the USA. "To catch on fire" probably equally so.

vram22 10 years ago |

Ha ha, slightly related fun snippet I wrote:

http://jugad2.blogspot.in/2015/07/cut-crap-absolutely-essent...

edwinyzh 10 years ago |

Very interesting, and I'm looking into integrating it to http://WritingOutliner.com (or as a separate Word addin) :)

Dowwie 10 years ago |

Thank you for working on this project and sharing it.

One of the more challenging sections in the GMAT entails sentence correction. A proselint-enabled GMAT prep for sentence correction would be very valuable.

amelius 10 years ago |

What kinds of NLP technique does this system use?

Is it possible to specify new rules in a high-level way?

Can it learn from examples?

Does it work on a sentence-by-sentence basis only, or does it "grasp" complete paragraphs?

raphman 10 years ago | |

Rules are defined in Python scripts which can have arbitrary complexity. However, it seems like most rules are just string or regex matching:

https://github.com/amperser/proselint/blob/master/proselint/...

mdpacer 10 years ago | |

> What kinds of NLP technique does this system use?

It depends on your interpretation of NLP. In a sense, all of the rules are hard coded, and so it does string token processing that happens to be informed by contributed interpretations of style guides' rules for usage. Thus, most of the NLP has been performed by the human programmers interpreting those rules.

Though we are interested in extensions in the direction of robust machine NLP approaches able to meet the other goals of proselint, that presents many challenges (including some I mention in response to your third question). Nonetheless, this is an active area of research.

> Is it possible to specify new rules in a high-level way?

In short, no, but it is an area of active research on our part to develop a rule-templating engine for exactly this purpose. "High-level" is subjective though, so there may always be someone who intends to ask about a level higher than the interface that we provide at the time that this question is asked.

> Can it learn from examples?

In a sense, yes, all of the rules have been learned by people from the example text in guides and translated to linting rules. But I do not think that was your intended question.

If instead you mean: you would provide it a set of examples of your writing and it would induce a rule, no it does not do that currently, and may not for quite some time.

Stylistic rule induction is a difficult – though interesting – problem (as is rule induction more generally). It is not something we are intrinsically opposed to, but the simplest version of learning from examples would violate two core principles of the design of proselint.

First, our rules are taken from and organised around the advice provided by respected authors in their writing on linguistic style.

Second, any inductive method will be intrinsically uncertain about the rules that it induces. This uncertainty will always be opposed to our aim of having a low false alarm rate, making inductive methods possible but subject to extensive tuning and testing. This suggests that further development of a test set outside of the examples provided would be needed, to ensure coverage of any of the rules that the examples would suggest inducing.

Additionally, almost all state-of-the-art machine learning systems would require a set of relevant labeled examples of usage errors and non-errors that would somehow generalise to the examples that you would like to provide it. Even specifying the data format would be difficult; if you have any insights as to how this would be done, please develop them below, it can only be helpful and aid progress in this direction.

> Does it work on a sentence-by-sentence basis only, or does it "grasp" complete paragraphs?

I think the easiest way for you to answer this question is for you to see it in action at this website: http://proselint.com/write/

I should mention that longer range dependencies require greater computational power which brushes up against another aim of proselint, to be fast enough to run on reasonably large files as a real-time linter. This may not always be the case in all instantiations of proselint, but for now this is true.

If you have paragraph level rules that you might want to suggest (like the issue I just created when writing this response: https://github.com/amperser/proselint/issues/310), please do! It is even more helpful if you can find an authoritative reference to include as part of your issue, because that will be needed to incorporate the rule into proselint.

jcoffland 10 years ago |

It would be interesting to run this against campaign speeches as a unbiased way of judging the quality of prose. Surely content is more important but still it would be fun.

brudgers 10 years ago |

Github: https://github.com/amperser/proselint/

willvarfar 10 years ago |

Its a python module? I'm looking forward to making a Pelican plugin so my mate can start checking his blog for glaring errors before he posts! :)

true_religion 10 years ago |

I'm curious is this just a grammar checker? Or does it do spell checking too like aspell?

oneeyedpigeon 10 years ago | |

No. Yes, and no. http://proselint.com/approach/

zimpenfish 10 years ago |

Most important question - How many linguists are on the team developing this?

erubin 10 years ago |

Can I use this with latex?

jake-low 10 years ago | |

Just tried it; you can. Seems like it strips markup characters so it should work well with most markup languages.

biturd 10 years ago |

FYI, seems to work perfectly find in Safari on Mac OS X Desktop.

stared 10 years ago |

What is wrong with "very smart"? (line 86)

uncletaco 10 years ago | |

"avoid using the word ‘very’ because it’s lazy. A man is not very tired, he is exhausted. Don’t use very sad, use morose. Language was invented for one reason, boys - to woo women - and, in that endeavor, laziness will not do." - Dead Poets Society

busyant 10 years ago | |

I worked w/ a guy who was good at editing my manuscripts. His opinion (which I agree with) was that the word "very" was almost always superfluous. You can delete it without affecting your message.

conceit 10 years ago | | |

Took me a while to see that very comes from veritas and doesn't mean much. At first I wrongly thought, I knew what the word means. Now I do know verily.

blt 10 years ago |

Microsoft Word had something like this round about 1999

Piskvorrr 10 years ago | |

Yeah, there's the squiggly line; same thing, right?

Similarly, where Tesla Model S is concerned: Ford Motor Company had something like this round about 1908. (Where "something like this" is "has four wheels and no horses")