David Eisinger


Journal > Spellcheck Your Hugo Site With CSpell

Posted 2024-11-20 under #meta

I edit these posts pretty carefully before publishing, but I inevitably find a misspelling or two after the fact. In the spirit of continuous improvement, I decided to see what kind of automated solutions are out there for spellchecking Markdown files, and found CSpell. It works well, but its default configuration found a ton of false positives that I had to scroll past to find the actual errors.

Fortunately, it’s quite configurable, and I’ve gotten it to where it only flags actual misspelled words. Here’s how.

1. Install CSpell

Assuming a modern version of Node.js (≥18), you can use npx to download and run CSpell in a single command:

npx cspell content/**/*.md

You’ll see a ton of spelling errors – ignore them for now.

2. Add config file

Next, let’s create a basic config file. In the root of your site, put the following in .cspell.json:

{
  "$schema": "https://raw.githubusercontent.com/streetsidesoftware/cspell/main/cspell.schema.json",
  "version": "0.2",
  "dictionaries": [
    "english"
  ]
}

3. Add additional languages

My site (especially the stuff in /elsewhere that I’ve mirrored from my company’s website) has code snippets that the English dictionary doesn’t recognize. Fortunately, CSpell ships with a bunch of additional dictionaries. Adding "ruby", "golang", and "java" to the "dictionaries" array makes a bunch of misspellings go away.

4. Ignore front matter

This one may or may not apply to your site, so feel free to ignore, but I see a lot of false positives in the front matter of my posts, mostly around the lists of references. To ignore the front matter section entirely, add the following to your config file (credit to this helpful GitHub comment):

"patterns": [
  {
    "name": "front_matter",
    "pattern": "/^(-{3}|[+]{3})$(\\s|\\S)*?^\\1$/gm"
  }
],
"languageSettings": [
  {
    "languageId": "markdown",
    "ignoreRegExpList": [
      "front_matter",
    ]
  }
]

Note that you’ll no longer catch misspellings in post titles, so it might make sense to use a more targeted regular expression.

5. Ignore proper nouns

I also see a lot of proper nouns being flagged as misspellings, so I decided to just ignore any word that begins with a capital letter. Create a new entry in the "patterns" array:

{
  "name": "proper_nouns",
  "pattern": "/[\\W_][A-Z][\\S]+/g"
}

That’s any non-word character (or an underscore), followed by a capital letter, followed by one or more non-space characters. I’m sure that’s not perfect, but it’s good enough for my content. Add the new pattern to the "ignoreRegExpList":

"languageSettings": [
  {
    "languageId": "markdown",
    "ignoreRegExpList": [
      "front_matter",
      "proper_nouns"
    ]
  }
]

6. Fix spelling

Now comes the hard part: run CSpell again (npx cspell content/**/*.md), look at all the misspellings it finds, and fix all the ones you consider to be valid. Computers can’t help us here, friend.

7. Create a custom dictionary

Now we’ll add all the unrecognized words to a custom dictionary so that CSpell will stop flagging them. First, create the list of words:

npx cspell --words-only --unique content/**/*.md | sort > .dictionary

Then add a new "dictionaryDefinitions" array in your config file:

"dictionaryDefinitions": [
  {
    "name": "exceptions",
    "path": ".dictionary",
    "addWords": true
  }
],

Finally, add "exceptions" to the "dictionaries" array. At this point, CSpell should find zero misspellings. To add new exceptions to the list in the future, you can run:

npx cspell --words-only --unique content/**/*.md >> .dictionary
sort -o .dictionary .dictionary

8. Add to build pipeline

With all this stuff set up, it’s dead simple to add spellchecking to the build pipeline to ensure you never publish misspellings. As long as your job runner has npx available, you can just run the same npx cspell content/**/*.md command you’ve been running locally in a build step. Here’s where I do it.


Here’s the final .cspell.json config file. I’m super happy with this setup – it’s already catching misspellings in the process of writing these words. I’m reminded of a post I read a few weeks ago, about the irony of how good and simple website publishing has become for technical people, and how complex it is for the less technically-inclined. Imagine trying to accomplish this same functionality in a typical CMS – it would not work well, if it worked at all.


References


Backlinks