Differences between revisions 1 and 41 (spanning 40 versions)
Revision 1 as of 2007-11-12 19:45:39
Size: 3684
Editor: ?ThomasViehmann
Comment: starting spam page
Revision 41 as of 2008-03-12 23:07:46
Size: 7113
Editor: ?ThomasViehmann
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
Note that all this is very preliminary. Note that all this is very preliminary. Comments and suggestions are very welcome.
Line 11: Line 11:
== Key points towards a spam removal policy == == Towards a spam removal policy ==

=== Policy corner stones ===
Line 15: Line 17:
 * The information which messages have been flagged junk and how that came to be (review logs) should be accessible along with the mailbox archives, so any developer can inspect the changes to the archive and complain to listmaster about removals.  * The information which messages have been flagged junk and how that came to be (review logs) should be accessible along with the mailbox archives, so any developer can inspect the changes to the archive and complain to listmaster about removals. This information is currently in http://liszt.debian.org/~tviehmann/spam-removals/ .
Line 18: Line 20:
Comments and suggestions are very welcome. === Ad hoc policy ===

Review standards should be set after seeing how things pan out, I am aiming at three reviewers, including one experienced one (after some bootstrapping). I hope this would minimize the risk of unwarrented removal. A rigorous standard seems to be necessary to obtain consensus with the project. As such, the three reviewers is only a guideline, not a rule. Of course, more reviewers doing shorter reviews would help tremendously. Ultimately, guaranteeing the integrety of the list archives currently falls in the realm of the Debian listmaster.
Line 25: Line 29:
 * Inappropriate is the (misguided) term presently used for misguided (unsubscribe, vac messages, spam backscatter, probably NOT votes sent to debian-devel instead of devotee) messages, is is not entirely clear what to do with those, but please tag them accordingly.  * Inappropriate is the (misguided) term presently used for misguided ((un)subscribe to list, test messages, replies to spam messages, vac messages, spam backscatter, probably NOT votes sent to debian-devel instead of devotee) messages, is is not entirely clear what to do with those, but please tag them accordingly.
Line 27: Line 31:
 * The program (invoked as {{{./newspamverify.py list.submission_collection}}}) and stores (partial) results in {{{list.submission_collection.report}}}. It resumes from the saved state when invoked next time.  * The program (invoked as {{{./newspamverify.py list.submission_collection}}}) stores (partial) results in {{{list.submission_collection.report}}}. It resumes from the saved state when invoked next time.
Line 29: Line 33:
 * Don't be overwhelmed by the size of the collections. They start at recent months and go back in time, so partial results are intersting as well.
   Feedback on how large submission batches you would like to review is welcome.
Line 30: Line 36:
 * Send ''clearsigned'' (using {{{gpg --clearsign}}}) {{{*.report}}} (possibly partial or trimmed with {{{grep -v '^ unchecked;' foo.report > 2foo.report}}} before signing, but don't accidentally erase the log when filtering) to [mailto:tv@beamnet.de Thomas]
Line 36: Line 43:
 * Have getting of collections and sending back reports automated in the script.
 * There must be a faster way to nominate Spam for reviewing than looking at one article, pressing the Spam-Button and then load the next. (Maybe a mutt-macro?)
 * The newspamverify-program doesn't understand MIME-Mails with base64 or HTML
 * Because of the many false positives in the Nominations we need to make sure that
   * Webbots don't press the Spam-Button
   * We only consider nominations with more than one nomination.
   * Known good mails (at least the ones tagged 'Not Spam' in the Reviewer Process) should not be get in the reviewing process again.
Line 39: Line 53:
If you want to jump in, add yourself here and contact [mailto:tv@beamnet.de Thomas] for coordination. Your help is appreciated. If you want to jump in, add yourself here and contact [mailto:tv@beamnet.de Thomas] (tomv_w on IRC) for coordination. Your help is appreciated.
Line 41: Line 55:
 * bas w.
 * giridha (looking at newer d-devel)
 * joy (looking at d-project)
 * pabs (looking at d-project)
=== Works in progress ===

Our goal is to have at least three reports before removing anything. For the following lists, we have some, but not enough review reports. The people mentioned already sent in reports. Your help can most immediately used if you review lists which already have some, but not enough names listed. Please add your name after you sent in your report. Lines with no names mean that the report is ready, but no one have elaborated it (yet).

||<rowbgcolor="#FFFFE0"> '''List''' || '''1st Report''' || '''2nd Report''' || '''3rd Report''' ||
|| {{{debian-devel}}} || [:Appaji:Y Giridhar Appaji Nag] || || ||
|| {{{debian-www}}} || cord || SandroTosi || ||
|| {{{debian-qa}}} || cord || SandroTosi || ||
|| {{{debian-devel-italian}}} || SandroTosi || || ||
|| {{{debian-l10n-italian}}} || SandroTosi || || ||
|| {{{debian-italian}}} || || || ||
|| {{{debian-python}}} (2nd round) || SandroTosi || || ||
|| {{{debian-amd64}}} || || || ||

=== People ===

 * wijnen (looked at d-project)
 * [:Appaji:Y Giridhar Appaji Nag] (looking at newer d-devel)
 * pabs (looked at d-project)
Line 47: Line 76:
 * Michael Koch/man-di (d-java and d-user-german)

== Success stories ==

 * debian-project has had [http://lists.debian.org/debian-project/2007/11/msg00202.html some spam removed]
 * debian-python: 205/250 submitted messages removed after checking by bzed, SandroTosi, tomv
 * debian-vote: 315 spam messages removed
 * debian-java: cord, man-di, SandroTosi
 * debian-user-german: bzed, cord, man-di
Line 50: Line 88:
 [http://people.debian.org/~tviehmann/spam/] has a python (2.5) script and sample data, don't hesitate to bug me for more  [http://liszt.debian.org/~tviehmann/spam/] has a python (2.5) script and sample data, don't hesitate to bug me (or any listmaster) for more.

 A browser extension is being designed by CyrilBrulebois, so as to display the messages directly from the online archives.
Line 52: Line 93:
CategoryTeams CategoryTeams CategoryTeams

Spam in the Debian List Archive

Note that all this is very preliminary. Comments and suggestions are very welcome.

Status quo

It has been claimed that the [http://lists.debian.org/ Debian list archives] contain spam email messages.

There is a "report as spam" button in on the list archive page of each message, but presently, spam is by and large not removed from the archives. The submissions seem to help (more or less) with finding spam but need manual review before they could be acted upon.

Towards a spam removal policy

Policy corner stones

  • Messages that are (beyond doubt) spam should be removed from the web archives. They should remain in the mailbox archives (and thus be accessible to developers on master.d.o).
  • Spam removals should be very conservative, with any doubt meaning no removal. For systematic removal, candidates need to be checked multiple times in order to minimize the risk of unmerited removal.
  • The information which messages have been flagged junk and how that came to be (review logs) should be accessible along with the mailbox archives, so any developer can inspect the changes to the archive and complain to listmaster about removals. This information is currently in http://liszt.debian.org/~tviehmann/spam-removals/ .

  • On the technical side, when removing messages from the list archives URIs of messages must not change. To this end, lists.debian.org uses a version of the mhonarc mailbox converter that has been enhanced to allow skipping spam.

Ad hoc policy

Review standards should be set after seeing how things pan out, I am aiming at three reviewers, including one experienced one (after some bootstrapping). I hope this would minimize the risk of unwarrented removal. A rigorous standard seems to be necessary to obtain consensus with the project. As such, the three reviewers is only a guideline, not a rule. Of course, more reviewers doing shorter reviews would help tremendously. Ultimately, guaranteeing the integrety of the list archives currently falls in the realm of the Debian listmaster.

Practical matters

About using newspamclassify.py:

  • Only tag as spam what really is absolutely surely spam. For example some people take offense to some comments on the lists, but that is not spam.
  • Ideally, multiple people classify messages and when everyone agrees, we can remove. For this, you send in a signed *.report. Note that this will be made public (at least to DDs) for verification of removals later.

  • Inappropriate is the (misguided) term presently used for misguided ((un)subscribe to list, test messages, replies to spam messages, vac messages, spam backscatter, probably NOT votes sent to debian-devel instead of devotee) messages, is is not entirely clear what to do with those, but please tag them accordingly.
  • There are four states spam, non-spam, inappropriate, unsure to use. There is an internal fifth "unchecked" for things you did not look at.

  • The program (invoked as ./newspamverify.py list.submission_collection) stores (partial) results in list.submission_collection.report. It resumes from the saved state when invoked next time.

  • Key commands: SNIU for classification, arrow up/down scrolls message, arrow left/right next/previous message. Back moves you back to the last message you classified and erases the classification. Quit saves and exits.

  • Don't be overwhelmed by the size of the collections. They start at recent months and go back in time, so partial results are intersting as well.
    • Feedback on how large submission batches you would like to review is welcome.
  • Be aware of bugs, a log will be try to be saved (with an even longer obscure file name) if something goes wrong.
  • Send clearsigned (using gpg --clearsign) *.report (possibly partial or trimmed with grep -v '^  unchecked;' foo.report > 2foo.report before signing, but don't accidentally erase the log when filtering) to [mailto:tv@beamnet.de Thomas]

Any suggestions on the above and/or the program are of course welcome.

Suggested Improvements

  • Graphical viewer that essentially renders the web archive html, including the thread links for context (idea by Pabs, errors by TV).
  • Have getting of collections and sending back reports automated in the script.
  • There must be a faster way to nominate Spam for reviewing than looking at one article, pressing the Spam-Button and then load the next. (Maybe a mutt-macro?)
  • The newspamverify-program doesn't understand MIME-Mails with base64 or HTML
  • Because of the many false positives in the Nominations we need to make sure that
    • Webbots don't press the Spam-Button
    • We only consider nominations with more than one nomination.
    • Known good mails (at least the ones tagged 'Not Spam' in the Reviewer Process) should not be get in the reviewing process again.

People doing this

If you want to jump in, add yourself here and contact [mailto:tv@beamnet.de Thomas] (tomv_w on IRC) for coordination. Your help is appreciated.

Works in progress

Our goal is to have at least three reports before removing anything. For the following lists, we have some, but not enough review reports. The people mentioned already sent in reports. Your help can most immediately used if you review lists which already have some, but not enough names listed. Please add your name after you sent in your report. Lines with no names mean that the report is ready, but no one have elaborated it (yet).

List

1st Report

2nd Report

3rd Report

debian-devel

[:Appaji:Y Giridhar Appaji Nag]

debian-www

cord

SandroTosi

debian-qa

cord

SandroTosi

debian-devel-italian

SandroTosi

debian-l10n-italian

SandroTosi

debian-italian

debian-python (2nd round)

SandroTosi

debian-amd64

People

  • wijnen (looked at d-project)
  • [:Appaji:Y Giridhar Appaji Nag] (looking at newer d-devel)

  • pabs (looked at d-project)
  • tale (looking at newer d-devel)
  • Thomas Viehmann (involved in listweb, looking at too much spam)
  • Michael Koch/man-di (d-java and d-user-german)

Success stories

Getting program and data


CategoryTeams