Spam in the Debian List Archive
Note that all this is very preliminary.
Status quo
It has been claimed that the [http://lists.debian.org/ Debian list archives] contain spam email messages.
There is a "report as spam" button in on the list archive page of each message, but presently, spam is by and large not removed from the archives. The submissions seem to help (more or less) with finding spam but need manual review before they could be acted upon.
Key points towards a spam removal policy
- Messages that are (beyond doubt) spam should be removed from the web archives. They should remain in the mailbox archives (and thus be accessible to developers on master.d.o).
- Spam removals should be very conservative, with any doubt meaning no removal. For systematic removal, candidates need to be checked multiple times in order to minimize the risk of unmerited removal.
- The information which messages have been flagged junk and how that came to be (review logs) should be accessible along with the mailbox archives, so any developer can inspect the changes to the archive and complain to listmaster about removals.
- On the technical side, when removing messages from the list archives URIs of messages must not change. To this end, lists.debian.org uses a version of the mhonarc mailbox converter that has been enhanced to allow skipping spam.
Comments and suggestions are very welcome.
Practical matters
About using newspamclassify.py:
- Only tag as spam what really is absolutely surely spam. For example some people take offense to some comments on the lists, but that is not spam.
Ideally, multiple people classify messages and when everyone agrees, we can remove. For this, you send in a signed *.report. Note that this will be made public (at least to DDs) for verification of removals later.
- Inappropriate is the (misguided) term presently used for misguided (unsubscribe, vac messages, spam backscatter, probably NOT votes sent to debian-devel instead of devotee) messages, is is not entirely clear what to do with those, but please tag them accordingly.
There are four states spam, non-spam, inappropriate, unsure to use. There is an internal fifth "unchecked" for things you did not look at.
The program (invoked as ./newspamverify.py list.submission_collection) and stores (partial) results in list.submission_collection.report. It resumes from the saved state when invoked next time.
Key commands: SNIU for classification, arrow up/down scrolls message, arrow left/right next/previous message. Back moves you back to the last message you classified and erases the classification. Quit saves and exits.
- Be aware of bugs, a log will be try to be saved (with an even longer obscure file name) if something goes wrong.
Any suggestions on the above and/or the program are of course welcome.
Suggested Improvements
- Graphical viewer that essentially renders the web archive html, including the thread links for context (idea by Pabs, errors by TV).
People doing this
If you want to jump in, add yourself here and contact [mailto:tv@beamnet.de Thomas] for coordination. Your help is appreciated.
- bas w.
- giridha (looking at newer d-devel)
- joy (looking at d-project)
- pabs (looking at d-project)
- tale (looking at newer d-devel)
- Thomas Viehmann (involved in listweb, looking at too much spam)
Getting program and data
[http://people.debian.org/~tviehmann/spam/] has a python (2.5) script and sample data, don't hesitate to bug me for more