The probabilistically sorted list algorithm, Part 2

I gave the probabilistically sorted list algorithm some thought last night, and this is what I’ve come up with so far:

  • The algorithm must keep track of how often a spamfilter is matched in a given period of time. That’s kind of obvious.
  • As recently matched entries must temporarily be ranked higher than entries that haven’t been recently matched, the algorithm should keep track of two ratios of the type that I mentioned in the first point – a short-term one, and a long-term one.
  • The short-term ratio would measure how many hits there were in the past seven days, and the long-term ratio would be the past 90 days. These time periods are, of course, open for debate. (Perhaps they should be user-configurable, with 7 and 90 being the defaults.)
  • Ratios should be recalculated every X seconds/minutes/hours. A smaller time period would mean more accuracy with the ratios, but could lead to more CPU usage (which is why I haven’t considered a real-time recalculation mechanism, although I could be wrong here), so this should be user-configurable as well. The problem with this, however, is that this value would have to be the same on all servers to prevent desynch issues. Obviously then all servers would calculate the new ratios at exactly the same time provided the server clocks are synchronized. (A side effect of this is that it would further emphasize the importance of keeping the server clocks synchronized.)
  • For every 20 hits that a spamfilter receives in a 60 second window, it must move up 5 places in the rankings. The new rank must be kept until the next ratio recalculation.
  • When ratios are recalculated, all spamfilter ratios with a short-term ratio should be ranked higher than spamfilters without a short-term ratio. (This could be where you, the reader, discover a flaw in my thinking.) Obviously, the greater the ratio, the higher the ranking.
  • Notwithstanding the point above, a spamfilter with only a long-term ratio (or no ratio at all) that suddenly gets triggered must have the ability to move to the #1 spot, regardless of the fact that it may not have had any ratio at the time of the last ratio recalculation
  • This information must be global amongst all servers, kind of like /GLINE lists. Again, that’s also kind of obvious.

As this is the first iteration, there’s bound to be some flaws in my thinking, so any form of contructive criticism will be much appreciated.

Bookmark the permalink.

One Response to The probabilistically sorted list algorithm, Part 2

  1. Loet says:

    Hi Ron

    I could not think how else to contact you, hence the comment to your blog.

    One of our senior developers came across your blog and was impressed by the content and general “attitude”. We are currently looking for a C++ coder and are wondering if your would be interested?

    We are based up in kloof and you can find out a little (and I mean a little) about us at http://www.aat.co.za.

    You can mail me at loet@aat.co.za or call me at 0824684464 if you are intetested to hear more about what we do and why we need a C++ coder.

    I hope to hear from you soon.

    Kind regards

    Loet