A SANE Paper Archival Pipeline

Problem Statement#

I have a problem with paper. I’ve got too much stuff on it, I know I need very little of it, but I (irrationally) fear that if I chuck or shred any of it I would immediately need it. So clearly, I needed a way to scan, archive, and index stuff so I could safely shred it.

TL,DR: My solution is just scanservjs + Paperless-ngx… but let me explain how I got there.

Existing Bits#

As an avid home-labber / self-hoster, Paperless-ngx was already on my radar as where documents should live, but I had less of an idea as to how to scan the documents in the first place. I’ve had Canon Pixmas since they were first introduced, and have had a Pixma MX410 in service for ages, though, I’ve only really used the scanning portion of its multi-function-ness in recent years, because who needs to print stuff in color?

It’s worked ok as a dedicated scanner: it’s got a platen and a nice ADF and the ability to scan straight to PDF, but I haven’t been able to scan directly to my macOS devices for years, and have been stuck scanning to thumbdrive. That was a pain, but what really pushed me over the edge was when it stopped scanning to usb devices, which basically left me with e-waste.

I first looked at what the Paperless Community recommends with regards to hardware, and it’s a ton of $200-300 document scanners – no platen! – that can write to network shares. I want to solve this problem, but would rather do basically anything else with that money than “buy a scanner”, so off to find other solutions.

SANE Solutions#

The reason I couldn’t scan to macOS was lack of software that would even install on modern macOS, but what about scanning straight to Linux? Canon allegedly offers a Debian package of drivers, but before going down that rabbit hole I asked the internet "howto scan on linux" and found the SANE Project. So I installed it, plugged the scanner into my server, and ran scanimage -L to see what it could find:

❯ scanimage -L
device 'pixma:MX410_###########' is a CANON Canon PIXMA MX410 multi-function peripheral
device 'epson2:net:###.###.###.###' is a Epson PID flatbed scanner
device 'airscan:w4:Canon MX410 series _###########' is a WSD Canon MX410 series _########### ip=###.###.###.###
device 'airscan:e3:EPSON ET-2800 Series' is a eSCL EPSON ET-2800 Series ip=###.###.###.###

It turns out SANE is so good that it informed me that I had OTHER SCANNERS on my network that I was unaware of. A year or so back Wife bought a cheap Epson printer that could be converted into a dye-sublimation printer. Turns out it has a scanner. And yeah, turns out I didn’t need to connect over USB, because it found the MX410 over the network, completely automagically. Poking around scanimage -A showed that SANE detected all of its features, including the ADF, and I just had to figure out how to actually scan stuff… which of course isn’t easy.

Configuring Frontends#

The scanner has buttons on the front, and has a concept of “scanning to a network device”, but that doesn’t work with SANE, but SANE can poll device button statuses, so I started looking at scanbd as recommended by various articles. Constantly polling a device I’m going to use maybe a couple times a month seems like severe overkill, I just need a fronend . . so duh I went to the SANE Frontends page and found scanservjs, which offers a helpfully pre-built Docker Container that has SANE baked into it, so you don’t even need the SANE tools installed on your host.

So I chucked the following in my docker-compose.yaml:

  paperless-ngx:
    image: lscr.io/linuxserver/paperless-ngx:latest
    container_name: paperless-ngx
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=America/New_York
    volumes:
      - ~/docker/paperless/config:/config
      - ~/docker/paperless/data:/data
    ports:
      - 8000:8000
    restart: unless-stopped

  scanservjs:
    image: sbs20/scanservjs:latest
    container_name: scanservjs
    volumes:
      - /var/run/dbus:/var/run/dbus # needed for airscan
      - ~/docker/scanservjs:/app/config
    ports:
      - 8001:8080
    restart: unless-stopped

That was it to get both services up and running. I did some tests in scanserv and confirmed that yes it scans and does everything it says on the box. Then I just had to glue them together by simply adding one more volume to the scanserv container:

      - ~/docker/paperless/data/consume:/app/data/output

That rigs scanserv’s output folder directly to Paperless’ ingestion folder.

Actually Using It#

Using it is pretty easy. If I have something to scan I walk over to the scanner (or, technically, a scanner because it works on both of them) with an iPad or whatever. I load up the scanserv-js UI at :8001 and tell it:

Which scanner?
Which input location? (ADF or Platen)
Size of the paper (Letter, A4, Legal, whatever)
Quality (DPI, Color / Grayscale)
Batch or Not (Are we scanning multiple pages from the ADF?)
Output Format (Image or PDF, which only matters for multi-page documents).

Oh, and if I’m scanning something weirdly-shaped (like an ID or whatever), I can even do a scan preview and crop it right there.

Hit scan, it scans.

Wait a minute or so and flip over to Paperless-ngx at :8000 and confirm that it’s there. Paperless will automatically convert everything it ingests into indexed PDFs, and will preserve the raw files too.

Then when I’m done with my batch of scans, I can head over to a Real Computer and rename the files in Paperless, add some metadata & tags… and then promptly forget about it and never access it because I probably could have just thrown it out in the first place.