Taming Text: How to Find, Organize, and Manipulate It
The book alternates between a great overview of the subject, and getting down-and-dirty with the code. I don’t know a good solution to this, but I was less concerned with the code and more concerned with the overhead view.
It’s a good discussion of how to manage text: how to tokenize it, search it, cluster it, and classify it. Towards the later chapters, it bogs way, way down – Chapter 7, on classification, is probably a quarter of the entire book in length. In many cases they “dropped to code” immediately without discussion the theory much.
In the end, I got what I came for – I understand the basic concepts of text processing. I now know what I don’t know, and that’s often half the battle.