Wednesday, January 14, 2004
Spidering Hacks

I bought a new book yesterday. It's entitled "Spidering Hacks",
it's published by O'Reilly. Since I have a relatively strong
interest in text classification, I thought using spidering to
extract information from the web would be a natural next step.

I mentioned in an earlier post that I wanted to write a slashdot
engine that included some of the more advanced featres I've been
contemplating. I really like how the slashdot moderation system
works, but I don't like how much time currently is spent doing
"troll" moderation. There are many worthy posts that never get
marked higher than their initial rating simply because people
spend too much time looking for trolls. Since there is an
extremely large corpus of pre-rated text passages, I think it
should be relatively simple to built a large text cluster and
use an automatic text classifier to bias posts.

We'll see how it works in practice though. Unfortunately I just
got buried at work again, hopefully though I'll be able to get
to this soon.

