Activity 3: Tokenization, Stemming, and Lemmatizing Reversal

Here are several paragraphs from Casino Royale by Ian Fleming.

Bond looked again over his shoulder. The patron was discussing the menu with the new customer. It was a perfectly normal scene. They exchanged smiles over some item on the menu and apparently agreed that it would suit, for the patron took the card and with, Bond guessed, a final exchange about the wine, withdrew.

The man seemed to realize that he was being watched. He looked up and gazed incuriously at them for a moment. Then he reached for a brief-case on the chair beside him, extracted a newspaper, and started to read it, his elbows propped up on the table.

When the man had turned his face towards them, Bond noticed that he had a black patch over one eye. It was not tied with a tape across, the eye, but screwed in like a monocle. Otherwise he seemed a friendly middle- aged man, with dark brown hair brushed straight back and, as Bond had seen while he was talking to the patron, particularly large, white teeth.

He turned back to Vesper. ‘Really, darling. He looks very innocent. Are you sure he’s the same man? We can’ t expect to have this place entirely to ourselves. ‘

Vesper’s face was still a white mask. She was clutching the edge of the table with both hands. He thought she was going to faint and almost rose to come round to her, but she made a gesture to stop him. Then she reached for a glass of wine and took a deep draught. The glass rattled on her teeth, and she brought up her other hand to help. Then she put the glass down.

Use this tool:

Javascript Porter Stemmer Online

Stem the text and answer these questions:

  1. Find TWO examples of consistent stemming, meaning two or more words that appear to be stemmed using the same logic. Is there some logical pattern that you can determine?

  2. Based on the answer to #1, is there a word you can think of that doesn’t appear in the original that would still match in a search? (Hint: you’re looking for a word that would also stem back to a token which appears in the output, using the same logic you identified in #1)

  3. What happened to punctuation and whitespace? Do you think this is appropriate? Can you think of a situation in searching where punctuation and whitespace might make a difference in the query logic?

  4. What happened to capitalization? Do you think this is appropriate? Can you think of a situation in searching where capitalization might make a difference in the query logic?

Note: I understand that many of you don’t use English as your first language. I will be forgiving on specifics.