Many people are vaguely aware that the word ‘hacker’ did not always refer to a computer burgler. Once you begin to ask about the details however things start to break down. When did it go from being computer tricksterism to computer trespass? I never seem to get the same answer twice. Some cite Steven Levy’s Hackers as having ruined the term by popularizing it for a generation of teenage punks. More informed respondents tell me that the 411 Gang corrupted ‘hacker’ through their antics. In my own research, one incredible primary source I keep coming back to is a bulletin board that existed circa 1980 called 8BBS. 8BBS was an open forum that ended up being primarily used by phreakers to discuss the art of phone and computer intrusion. Having been used as one of the major primary sources in Katie Hefner’s Cyberpunk it seems like I’d hear about it more often.
The basic reason I don’t is that it wasn’t available on the web until recently, and it’s only available in a nasty PDF format. For my own personal use this is highly inconvenient, but for getting others interested it’s basically a show stopper. Realizing that nobody else is going to fix it and this is a valuable thing that deserves to be on the open web, I’d like to sketch a plan for restoring this source to a decent web based home.
The Problem: It’s A Scan…Of A Printout
The original medium this source came in had search and good indexing. It was easy to read a thread or find posts by a certain user. The printout format strips away search, and mangles the indexing a bit since page number and post ID are correlated but not quite the same. Worse, the printout was hole punched at some point so it could go into a binder, and this destroyed some of the information. The scan adds a further layer of obfuscation, as the quality of the documents and their flaws are magnified in a scanning environment. Subtle details lost in the scanning process make it highly unpleasant to read the resulting document.
The archive.org material includes a 500mb archive of individual jp2 images of the scanned pages, these could easily be imported into a wiki at which point it would be fairly easy to tag what post ID’s appear on each page. This way you’d be able to browse the posts by their ID number even if it’s just seeing the post as an image.
Restoring Readability and Search
The textual nature of the text must be restored to provide it with improved readability through CSS. This means the text must be extracted from the images somehow.
Text Extraction Methods
OCR - I haven’t really been able to get OCR to work very well, so far the packages I’ve used have been of sufficiently low quality that it was faster to type. The closest results I’ve seen came from this tool which is specifically designed to recognize fixed-width fonts.
Human Transcription - This is what I’ve had the most success with so far, but it’s labor intensive and slow. One hope is that by doing things on a wiki platform I can enlist the assistance of others in transcribing images.
Mechanical Turk - A subpoint of the above, it occurs to me that I could pay for an online transcription service like mechanical turk to do the pages. I’m not sure how much I’d have to pay to get something accurate, but perhaps the fact that the participant is making history might allow me to get services at a lower rate. (So as not to sound callous: I’m to understand Turkers are sensitive to that kind of thing since the wages are low enough that many participants are there to have something to do at their job during breaks, for a ‘normal’ transcription service I would assume it’s just business.)
Once the images are transcribed search is just a matter of getting them into a platform which features decent search. PmWiki has good search, so I’m not particularly worried about this aspect as much as I am about being able to get the posts into text at all.