Generating Automatic Bookmarks in Text
The New York Times has implemented a system called ‘Intensity’ that generates automatic bookmarks for every paragraph in a long text block. The system works by generating a key using the first letters of the first and last words of the first and final sentences of the paragraph. However, identifying paragraphs consistently, even when editors rearrange them, is challenging.
Generated by Azure AI on June 24, 2024Recently, I sent someone a link to a long Gadgetopia post, and I wanted them to read one particular paragraph. So, I had to tell them, scroll down about halfway to the paragraph that starts… It was annoying. What I needed was automatic, stable bookmarks applied to every paragraph in that long block of text.
The New York Times did something like this with a system they called Emphasis. They released version 2 of it some time ago, and it works really well once you know how to use it. For instance, here’s a link that highlights a specific sentence in the first paragraph of an article (but, weirdly, it doesn’t scroll you down below the fold).
The biggest trick here – and the thing that will derail most attempts to do this – is being able to identify paragraphs consistently, even when editors start rearranging things. Text elements are not stable like managed content or database records. In fact, paragraphs of text within a larger page are fairly volatile– people edit, and they can move stuff around – so getting a stable key can be tricky.
The Times did something fairly impressive here. They generated their key using the first letters of the first three words of the first and last sentences of the paragraph. So, each paragraph would end up with a six-letter key. For this paragraph, it would be: “TTdFtp”
When someone comes in with that key as a bookmark, they try to match it exactly. If they can’t (because the article changed), then they try to match half of it, hoping that only one of the sentences was changed. If they can’t do that either, then they use some form of the Levenshtein distance to find a paragraph with similar-enough text that it might be our missing paragraph.
(The code for Emphasis is on Github.)
I was reminded of this while reading a blog post series of someone trying to do something similar with Episerver. In this case, they’re trying to allow commenting on intra-text elements, and they’re going a route using GUIDs and changes to TinyMCE to keep those GUIDs stable during editing.
And, just the other day, I saw the same thing with the Django Docs. What they do is generate a bookmark with every heading, which lets people deep link into text, however there doesn’t appear to be a matching algorithm – it just uses the text of the heading, lower-cased and dash-delimited. This will work in more cases than not, but it’s fairly brittle.
Has anyone else seen other attempts at this? I’d be interested in other ways to do it.