Tag Archive for text

A script for cleaning text for sadface

This Bourne shell script takes text files like Cory Doctorow plaintext books and turns them into one-sentence-per-line files that sadface likes to read.

Code

#!/bin/bash
#
# uses args $1 $2 $3
# usage:-$ sed-cleaning.sh inputfile temporaryfile outputfile

# Get rid of unicode crap
cat $1 | tr "\r\n" " " > $2 # All hail http://www.unix.com/41858-post5.html

# run this with sed -nf sed-cleaning file.txt

sed -e 's/Mr\. /Mr /g
s/Mrs\. /Mrs /g
s/\x0A/\n/g
s/\n\n/\n/g
s/\.\n/\. /g
s/\n/ /g
s/\n /\n/g
s/\n /\n/g
s/\n /\n/g
s/\n /\n/g
s/\. /\. /g
s/\. \. \./\.\.\./g
s/ \./\./g
s/"//g
s/\. /\.\n/g
s/_//g
s/Mrs /Mrs\. /g
s/Mrs\.\. /Mrs\. /g
s/Mr /Mr\. /g
s/Mr\.\. /Mr\. /g' <$2 >$3

Example

Using the following passage from Doctorow’s A Place So Foreign and Eight More:

“Mama, I’m _not_ a super-villain,” Hershie said for the millionth time. He
chased the last of the gravy on his plate with a hunk of dark rye, skirting the
shriveled derma left behind from his kishka. Ever since the bugouts had inducted
Earth into their Galactic Federation, promising to end war, crime, and
corruption, he’d found himself at loose ends. His adoptive Earth-mother, who’d
named him Hershie Abromowicz, had talked him into meeting her at her favorite
restaurant in the heart of Toronto’s Gaza Strip.

sed-cleaning returns:

Mama, I’m not a super-villain, Hershie said for the millionth time.
He chased the last of the gravy on his plate with a hunk of dark rye, skirting the shriveled derma left behind from his kishka.
Ever since the bugouts had inducted Earth into their Galactic Federation, promising to end war, crime, and corruption, he’d found himself at loose ends.
His adoptive Earth-mother, who’d named him Hershie Abromowicz, had talked him into meeting her at her favorite restaurant in the heart of Toronto’s Gaza Strip.

The original text was manually wrapped with a newline every 80 characters. Now it’s wrapped with a newline after every period, not including Mr. and Mrs. If you want detection of other titles, follow the format given for Mr. and Mrs.

Credits

Text from The Super Man and the Bugout used under a Creative Commons 1.0 Attribution-NoDerivatives-NonCommercial License.

linuxer_rh on the Unix and Linux Forums at Unix.com

paradigm from the Ohio State University Open Source Club IRC channel