I've looked at several existing NLP frameworks (Open NLP, Stanford NLP) and none of them are accurate enough -- they fail on things like adjectives and old english second person pronouns. This makes them practically unusable for proper sense diambiguation, lemma and part of speech based rules, etc.
The Open NLP tokenizer is also terrible at tokenizing title abbreviations ("Dr", etc.) and things like the use of "--" to delimit text, which is frequently found various Project Gutenberg texts. You can train the Open NLP tokenizer, but it works on what it has seen, so you need to give it every variation of "(Mr|Mrs|Miss|Ms|Rev|Dr|...). [A-Z]" for it to tokenize those titles; the same for other tokens.
I find it substantially better than other tools as PoS tagger.
Also worth noting the that your assertion that you need these features to classify genres isn't obviously true to me at all.
For detecting uses of nouns like werewolf/werewolves, or vampire/vampires, I at least need the lemma to avoid writing different cases or a regex for each noun. Likewise, lemmatization can be used to handle different spellings (e.g. vampyre, or were-wolf). Similarly for verbs.
Lemmatization works best when it is coupled with part of speech tagging, so you avoid removing the -ing in adverbs for example.
Part of speech tagging also helps avoid incorrect labeling, such as not tagging 'bit' in "a bit is a single binary value" as the verb "to bite".
That's for the simple case.
Then there are more complex cases, like generalizing "[NP] was bitten by the vampire.", where NP can be a personal pronoun (he, she, etc.) or a name. There can also be other ways to say the same thing, e.g. "The vampire bit [NP] neck." where NP is now the object form (his, her, etc.) not the subject form. With UniversalDependencies or similar style dependency relations, you could match and label sentence fragments of the form "verb=bite, nsubj=vampire, obj=NP" (like in the first sentence) and "verb=bite, nsubj:pass=NP, obj=vampire" (like in the second sentence).
Without NLP, it becomes even harder to detect split variants like "cut off his head" and "cut his head off", which are the same thing written in different ways. I want to detect things like that and label the entire fragment "beheading", including other noun phrase variants.
With more advanced NLP features -- like coreference resolution (resolving instances of he/she/etc. to the same person), and information extraction (e.g. Dracula is a vampire) -- it would be possible to tag even more sentences and sentence fragments.
https://en.wikipedia.org/wiki/English_possessive#Nouns_and_n...
The other issue is, if you do focus on LLMs, it's too hyped your research would be too overlapping/competing especially as you've got a dissertation to write. It's a hard problem.
If you are in the field of Information and Communication Technology (ICT) there are hardly any area in the field which their fundamentals do not have Shannon's hands in it.
Leonard Kleinrock once remarked that he has to focus on the exotic queuing theory field that later leads to the packet switching and then Internet because most of the fundamentals problems in electrical and computer engineering (older version of ICT) have already been solved by Shannon.
There are plenty of research directions that are outlined in this document that don't require huge compute budget.
Basically because of the slow pace of review and publication the letters column became a way to talk about recent results or problems, and then follow up letters (i.e. comments on the blog posts) became common. So the editors decided to hive it off and speed up its publication schedule.
Arxiv is vital for quickly developing research fields.
I guess then the PhD student is indeed grammatically a singular then. It can still refer to a PhD student in general though instead of a particular one.
https://www.rit.edu/ntid/sea/processes/articles/grammatical/...
But I did a lot of work on this type of thing and the only time I found this sentence analysis approach was useful as classifier features was in a legal context where there were variants of very specific language we wanted to find.
There it worked because we could write rules on the features without relying on training data.
Tf-idf on ngrams using a rolling window would certainly work to detect the beheading variants you gave as examples.
Again: try without the parsing features. There's a good reason they are rarely used in classifiers: they are too unreliable to improve performance over simple approaches.
>I also want to be able to assess how much of the text is about a given topic, so that if I'm interested in reading a detective story from e.g. the Project Gutenberg collection, I don't want it to pick up a story where a detective is only mentioned in one paragraph.
This more like NLU than an NLP problem isn't it? It's like tracking how much of a Harry Potter book contains Voldemort content without knowing ahead of time that he may be referred to as He Who Must Not Be Named, You-Know-Who, The Dark Lord and so on. One would have to first identify the thing you're interested in, then learn when characters/the author invent new ways to refer to it, and carry all those forwards to find new instances. Fun!
It's also hard to write custom inference/tagging rules, like in the case you mentioned w.r.t. Voldemort, if you don't know what the tokens look like.
The term peer review was virtually non-existent prior to the 1960s. And despite that, nearly everything in modern society can ultimately be attributed to breakthroughs that happened prior to the advent of peer-review.
https://books.google.com/ngrams/graph?content=peer+review&ye...
All of this does seem to be extremely excessive to choose a book genre though. I would imagine the number of books after a simplistic clustering technique would be rather small to flip through, so I really don't understand the use case at all.
If you have very few books (few thousands) then you can apply more fine grained analyses in reasonable amounts of computation, such as contextualized embedding methods. But if the point is to select a book, there no real benefit since the simple 2 second term frequency methods would narrow choices down to only a few books.
If you have billions of books, contextualized embeddings become quite expensive to produce and use (several weeks or months of processing, petabytes of storage, etc), so it's not really feasible as an individual, But the extra querying capability does help narrow the large set down.
It does correlate perfectly with when modern scholars point to when the institutions were captured.