About using newspamclassify.py:
- Only tag as spam what really is absolutely surely spam. For example some people take offense to some comments on the lists, but that is not spam.
Ideally, multiple people classify messages and when everyone agrees, we can remove. For this, you send in a signed *.report. Note that this will be made public (at least to DDs) for verification of removals later.
- Inappropriate is the (misguided) term presently used for misguided ((un)subscribe to list, test messages, replies to Spam messages, vac messages, spam backscatter, probably NOT votes sent to debian-devel instead of devotee) messages, is is not entirely clear what to do with those, but please tag them accordingly.
There are four states spam, non-spam, inappropriate, unsure to use. There is an internal fifth "unchecked" for things you did not look at.
The program (invoked as ./newspamverify.py list.submission_collection) stores (partial) results in list.submission_collection.report. It resumes from the saved state when invoked next time.
Key commands: SNIU for classification, arrow up/down scrolls message, arrow left/right next/previous message. Back moves you back to the last message you classified and erases the classification. Quit saves and exits.
- Don't be overwhelmed by the size of the collections. They start at recent months and go back in time, so partial results are interesting as well.
- Feedback on how large submission batches you would like to review is welcome.
- Be aware of bugs, a log will be try to be saved (with an even longer obscure file name) if something goes wrong.
Send clearsigned (using gpg --clearsign) *.report (possibly partial or trimmed with grep -v '^ unchecked;' foo.report > 2foo.report before signing, but don't accidentally erase the log when filtering) to (currently vacant)
Any suggestions on the above and/or the program are of course welcome.
Suggested Improvements
- Graphical viewer that essentially renders the web archive html, including the thread links for context (idea by Pabs, errors by TV).
- Have getting of collections and sending back reports automated in the script.
- There must be a faster way to nominate Spam for reviewing than looking at one article, pressing the Spam-Button and then load the next. (Maybe a mutt-macro?)
- The newspamverify-program doesn't understand MIME-Mails with base64 or HTML
- Because of the many false positives in the Nominations we need to make sure that
- Webbots don't press the Spam-Button
- We only consider nominations with more than one nomination.
- Known good mails (at least the ones tagged 'Not Spam' in the Reviewer Process) should not be get in the reviewing process again.
People doing this
If you want to jump in, add yourself here and contact CordBeermann for coordination. Your help is appreciated.
Works in progress
Our goal is to have at least three reports before removing anything. For the following lists, we have some, but not enough review reports. The people mentioned already sent in reports. Your help can most immediately used if you review lists which already have some, but not enough names listed. Please add your name after you sent in your report. Lines with no names mean that the report is ready, but no one have elaborated it (yet).
List |
1st Report |
2nd Report |
3rd Report |
debian-devel |
|
||
debian-devel-italian |
|
|
|
debian-l10n-italian |
|
|
|
debian-italian |
|
|
|
debian-python (2nd round) |
|
|
|
debian-amd64 |
|
|
|
debian-security |
|
|
|
debian-68k |
SandroTosi (ready-to-report) |
|
|
debian-accessibility |
|
|
|
debian-alpha |
SandroTosi (ready-to-report) |
|
|
debian-apache |
SandroTosi (ready-to-report) |
|
|
debian-arm |
SandroTosi (ready-to-report) |
|
|
debian-firewall |
SandroTosi (ready-to-report) |
|
|
People
- wijnen (looked at d-project)
Y Giridhar Appaji Nag (looking at newer d-devel)
- pabs (looked at d-project)
- tale (looking at newer d-devel)
- Michael Koch/man-di (d-java and d-user-german)
Success stories
List |
Stats: reported/spam_removed |
Thank goes to... |
debian-project |
hecker, pabs, tviehmann, wijnen |
|
debian-python |
bzed, SandroTosi, tomv |
|
debian-vote |
315 spam messages removed |
|
debian-java |
CordBeermann, man-di, SandroTosi |
|
debian-user-german |
bzed, CordBeermann, man-di |
|
debian-release |
||
debian-qa |
||
debian-newmaint |
||
debian-www |
Getting program and data
http://lists.debian.org/archive-spam-removals/spam/ has a python (2.5) script and sample data, don't hesitate to bug me (or any listmaster) for more.
A browser extension is being designed by CyrilBrulebois, so as to display the messages directly from the online archives.