weblog.masukomi.org

mah-soo-koh-me

 

Caterpillar 3.0 Release May 30, 2007

Filed under: Uncategorized — masukomi @ 2:55 am

Caterpillar 3.0 is a
proof of concept app to
demonstrate how incredibly useful Bayesian filtering can be when tasked with
finding “interesting” articles instead of, or in addition to, spam.
Download
it here
(10.7 Mb). It works for me. I enjoy it. and it makes my life easier.
Maybe it will yours too.



How does it work?

Well, the simple explanation is that Caterpillar watches what you read and what
you don’t read. If you click on a title Caterpillar figures it must have been at
least somewhat “interesting” to you. If you click on a link in a post it figures
the post’s contents were probably “interesting” too. If something sits around in
your list for so long that it eventually gets purged Caterpillar figures that
it’s title probably wasn’t very “interesting” to you.



In a surprisingly short amount of time Caterpillar is able to start picking out
articles that you’ll probably find interesting. These it will highlight in
green, instead of the standard blue. That’s it. The only time you should ever
have to specifically train Caterpillar is when it gets something wrong, which
isn’t that often. So, if Caterpillar thinks something would be interesting to
you but you really disagree just select it and choose “Downvote” from the
“Entries” menu. That’s it. You can, of course, Upvote things too but you really
shouldn’t need to.



Who should use it?

Caterpillar is good for two primary groups of people.

  1. People with way too many subscriptions and not enough time or energy to sift
    through them all for the posts that are actually worth reading.
  2. People who want a simple user interface for their feed reader.

And, of course, randomly curious geeks. Regardless of why you’ve chosen to try
Caterpillar it’s important that you remember that it is just a proof of concept
and as such it may have some rough edges.



What about screenshots?

Visually it hasn’t changed much since the 2.0 release and there are plenty of
screenshots of that on
the
Caterpillar 2.0 site
. There’s now the highlighting of “interesting” items in
green, a few more menu options, and a couple extra useful links and info when
reading an article. See “Why Release it Now?” for why I don’t have updated
screenshots.



Requirements & Use

It requires Java 5 or higher,
although if you have some pressing need, and you’re a geek, you could compile it
under Java 1.4.x and it should work.

To run it just double click on the Caterpillar.jar file in their GUI and all
should work. OS X users should be able to
just double click on the pretty icon.

You can also use the command line, change to the Caterpillar directory and type:
java -jar Caterpillar.jar



Give it a week.

If you read a crazy number of feeds like I do you’ll probably see Caterpillar
start picking out new entries for you by the end of your first day. If you’re
like most people it may take a bit longer. The more you use it, the faster it
learns. It’s that simple. Just don’t click on random entries in hopes that it
will help. It won’t. Just read the entries you would normally read and let it
learn what’s really interesting to you.



How do I import feeds from my current
aggregator?


Most feed aggregators will allow you to export your list of feeds as an OPML
file. Rename that file exportedFeeds.opml and place it in the same directory as
the Caterpillar.jar file. Then choose “Import Feeds” from the file menu and give
it a while to go and download them. If you started it from the command line
you’ll see it say calling out the names of the feeds it’s importing as it goes.



What about a manual?

The
Caterpillar 2.0 site
has the Caterpillar 2.0 docs which cover basically
everything except the Bayesian learning stuff which I just covered in the “How
does it work?” section above. See “Why Release it Now?” for why I don’t have an
updated manual.



So what do I mean by  “proof of concept”?

Well, Caterpillar’s a good app. I use it every day. But it’s got some
limitations, and right now I just don’t have the time to fix them. So here they
are in no particular order:

  • Right now Caterpillar’s XML parsing is too strict. It only tolerates well
    formed XML but, unfortunately, there are a lot of blog posts out there tha
    don’t have their contents properly encoded. If it encounters a poorly formed
    feed you’ll get an error dialog that says “There was a problem parsing the
    xml for …..” with some details. It was written before good feed parsing
    libraries like Rome existed.
  • Caterpillar mixes the entries from all the feeds together into one long
    list. You can use the pull down to select entries from a particular feed if
    you like though. This is because Caterpillar was written to help me weed
    through the entries in the 300+ feeds I read. I don’t have time to read ALL
    the posts and clicking on every one of three hundred feed titles, the way
    I’d have to in most readers would take a
    really long time. Plus I don’t care
    which feed the “interesting” posts come from. I just care that they’re
    interesting. Don’t you?
  • Caterpillar doesn’t support auto-discovery. This means you have to tell in
    the url of the actual feed not just give it the web page’s URL and hope it
    can find it.
  • Caterpillar can take up a lot of
    memory.
  • It’s been over two years since I last did any real work on Caterpillar, and
    it was never intended to grow to the point that it has. From a code
    standpoint there is much room for improvement, and a number of the libraries
    could do with an update.
  • Caterpillar uses Aspirin when you want to send e-mails, which means you
    don’t have to configure anything. Unfortunately, it also means you’ll be
    sending e-mail form your box and you would be surprised at the number of
    ISPs who just block any e-mail coming from a dynamic IP address because they
    assume it must be spam. And no,
    Caterpillar never sends e-mails unless you tell it to.
  • The import OPML import could obviously use some improvement, and I think the
    export still uses it’s old internal format that can’t be read by other apps.
  • It doesn’t whistle any more. It used to have this nice, non-annoying, train
    whistle that would sound when it found new entries. There’s even a “Make it
    whisle” menu item that just made it whistle. It no longer whistles. This
    makes me sad. :(
    Correction: It DOES whistle it just appears that Ubuntu’s
    love/hate relationship with my sound card chose yesterday to rear it’s head again.
  • 10.7 Mb seems a bit much for such a small app…
  • Sometimes the search functionality looses it’s brain and neets to be reset.



Why release it now?

Or, more to the point, why release it in an unfinished state? Well, I had
intended to finish polishing it up and release it as a commercial product. But,
that was over two years ago and there have just been too many other projects on
my plate that are more important to me. I’d rather see people get some use out
of it than have it continue to sit on my computer benefiting no-one but me.
I’m also hoping that some smart programmer at
Google will see the value of positive Bayesian filtering and apply it to Google
Reader and Gmail.
Just imagine how awesome it would be if all those
mailing lists you subscribe to had a filter looking for “interesting” posts for
you so that you didn’t have to read everything or feel so overloaded that you
end up reading nothing.



Wanna help?

If you’re a Java geek feel free to download the source
with Darcs from
http://caterpillar.masukomi.org/code/caterpillar3

Tweak it however you want, add whatever feature you want, use the send feature
of Darcs to send me a patch file (masukomi at masukomi dot org) and I’ll
probably add it in. I figure any forward motion in Caterpillar is good at this
point. The only restriction being it needs to have a unit test with it. Yes, I
know, it seems hypocritical in light of the utter lack of tests in Caterpillar’s
source but since I wrote it I got the full-on testing religion so… deal :P .
To build Caterpillar just switch into the build directory and run ant.



Bugs & Feature requests



Report a bug.Report
a Caterpillar bug


Suggest a feature.Request
a Caterpillar feature




License & Copyright

The Caterpillar feed aggregator version 3.0 is copyright 2007 Kate Rhodes
(masukomi at masukomi dot org) and is released under the
GPL v2.0.
Have fun. Don’t blow anything up. Convince your rich company that they should
buy the source from me so that they can sell it under any license they want. Or
hire me and pay me a
decent salary. Either / or…

 

Leave a Reply