Archive for May 2012

Sadface: now on GitHub

That’s all there is to say, really.

For more on sadface, read the readme.

A script for cleaning text for sadface

This Bourne shell script takes text files like Cory Doctorow plaintext books and turns them into one-sentence-per-line files that sadface likes to read.

Code

#!/bin/bash
#
# uses args $1 $2 $3
# usage:-$ sed-cleaning.sh inputfile temporaryfile outputfile

# Get rid of unicode crap
cat $1 | tr "\r\n" " " > $2 # All hail http://www.unix.com/41858-post5.html

# run this with sed -nf sed-cleaning file.txt

sed -e 's/Mr\. /Mr /g
s/Mrs\. /Mrs /g
s/\x0A/\n/g
s/\n\n/\n/g
s/\.\n/\. /g
s/\n/ /g
s/\n /\n/g
s/\n /\n/g
s/\n /\n/g
s/\n /\n/g
s/\. /\. /g
s/\. \. \./\.\.\./g
s/ \./\./g
s/"//g
s/\. /\.\n/g
s/_//g
s/Mrs /Mrs\. /g
s/Mrs\.\. /Mrs\. /g
s/Mr /Mr\. /g
s/Mr\.\. /Mr\. /g' <$2 >$3

Example

Using the following passage from Doctorow’s A Place So Foreign and Eight More:

“Mama, I’m _not_ a super-villain,” Hershie said for the millionth time. He
chased the last of the gravy on his plate with a hunk of dark rye, skirting the
shriveled derma left behind from his kishka. Ever since the bugouts had inducted
Earth into their Galactic Federation, promising to end war, crime, and
corruption, he’d found himself at loose ends. His adoptive Earth-mother, who’d
named him Hershie Abromowicz, had talked him into meeting her at her favorite
restaurant in the heart of Toronto’s Gaza Strip.

sed-cleaning returns:

Mama, I’m not a super-villain, Hershie said for the millionth time.
He chased the last of the gravy on his plate with a hunk of dark rye, skirting the shriveled derma left behind from his kishka.
Ever since the bugouts had inducted Earth into their Galactic Federation, promising to end war, crime, and corruption, he’d found himself at loose ends.
His adoptive Earth-mother, who’d named him Hershie Abromowicz, had talked him into meeting her at her favorite restaurant in the heart of Toronto’s Gaza Strip.

The original text was manually wrapped with a newline every 80 characters. Now it’s wrapped with a newline after every period, not including Mr. and Mrs. If you want detection of other titles, follow the format given for Mr. and Mrs.

Credits

Text from The Super Man and the Bugout used under a Creative Commons 1.0 Attribution-NoDerivatives-NonCommercial License.

linuxer_rh on the Unix and Linux Forums at Unix.com

paradigm from the Ohio State University Open Source Club IRC channel

sadface-bot: A Markov chain bot

Markov bots make for amusing text generators. They don’t make much sense, usually. When they do make sense, it’s pure chance.

sadface draws its vocabulary and concepts from a flat text file, where each line is considered a sentence. The bot chains words together to create sentences, which it passes to the IRC channel it is in.

Right now, sadface only supports one channel, but you can have multiple instances of sadface running with different configuration files. The configuration file is specified at runtime as an argument: python sadface-configgable.py config-file.ini

Included in sadface-bot.zip are sadface-configgable.py and default.ini. If you want to change default.ini, I encourage you to copy default.ini and change the variables, so you can have an untouched default.ini.

You can start sadface with a blank brain_file.txt, but its replies won’t make much sense at all until it’s heard a lot of conversation. I recommend putting several books into the file. Project Gutenberg is a good place to start. Separate sentences by newlines. Replies look best if there are no quotes or tabs in brain_file.txt. You can specify different brain files with your config.ini.

Interact

To play with sadface, /join #sadface on irc.foonetic.net, or supply your own IRC server, channel and brain_file.txt.

Download

sadface depends on:

  • Python 2.7.3, available in repositories or at Python.org
  • python-twisted, available in the Ubuntu, Debian and openSUSE repositories or from source at the Twisted downloads page. Installers are available for Windows and Mac.

Download sadface-bot.zip

Credits

sadface is heavily based off of Eric Florenzano‘s MomBot, which uses the twisted network stack to handle IRC.

sadface uses configuration methods written by hhokanson for AnonBot, an IRC channel anonymizer.