Differences between revisions 1 and 73 (spanning 72 versions)
Revision 1 as of 2007-11-12 19:45:39
Size: 3684
Editor: ?ThomasViehmann
Comment: starting spam page
Revision 73 as of 2009-04-28 18:01:40
Size: 5055
Editor: GeoffSimmons
Comment: Add CategoryPermalink.
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
Note that all this is very preliminary. Comments and suggestions are very welcome.
Line 7: Line 7:
It has been claimed that the [http://lists.debian.org/ Debian list archives] contain spam email messages. It has been claimed that the [[http://lists.debian.org/|Debian list archives]] contain spam email messages.
Line 9: Line 9:
There is a "report as spam" button in on the list archive page of each message, but presently, spam is by and large not removed from the archives. The submissions seem to help (more or less) with finding spam but need manual review before they could be acted upon. There is a "report as spam" button in on the list archive page of each message. The submissions seem to help (more or less) with finding spam but need manual review before they could be acted upon.
Line 11: Line 11:
== Key points towards a spam removal policy == == Towards a spam removal policy ==

=== Policy corner stones ===
Line 15: Line 17:
 * The information which messages have been flagged junk and how that came to be (review logs) should be accessible along with the mailbox archives, so any developer can inspect the changes to the archive and complain to listmaster about removals.  * The information which messages have been flagged junk and how that came to be (review logs) should be accessible along with the mailbox archives, so any developer can inspect the changes to the archive and complain to listmaster about removals. This information is currently in http://lists.debian.org/archive-spam-removals/spam-removals/ .
Line 18: Line 20:
Comments and suggestions are very welcome. === Ad hoc policy ===

Review standards should be set after seeing how things pan out.

For the start we accept three undisputed reviewers-ratings. If one reviewer has a different rating, there has to be a stronger majority. As formula:

   it is Spam if $num{spam} > (3+2*$num{ham}+2*$num{inappropriate})

same applies for Ham and Inappropriate.

I hope this would minimize the risk of unwarranted removal. A rigorous standard seems to be necessary to obtain consensus with the project. As such, the three reviewers is only a guideline, not a rule. Of course, more reviewers doing shorter reviews would help tremendously. Ultimately, guaranteeing the integrety of the list archives currently falls in the realm of the Debian listmaster. If the numbers above don't work out, the levels can be changed easily without a hassle, previous blocked content will become available again.
Line 22: Line 34:
About using {{{newspamclassify.py}}}: === discontinued script based effort to flag spam. ===

Description: [[ListMaster/ListArchiveSpam/newspamclassify.py]]

If you have used this one to flag spam, then your work isn't lost. Please send it in, your reports can be converted into the new internal format, so the new method knows which posts have been rated by you.

=== new web-based effort to flag spam. ===
Line 24: Line 43:
 * Ideally, multiple people classify messages and when everyone agrees, we can remove. For this, you send in a signed {{{*.report}}}. Note that this will be made public (at least to DDs) for verification of removals later.
 * Inappropriate is the (misguided) term presently used for misguided (unsubscribe, vac messages, spam backscatter, probably NOT votes sent to debian-devel instead of devotee) messages, is is not entirely clear what to do with those, but please tag them accordingly.
 * There are four states {{{spam, non-spam, inappropriate, unsure}}} to use. There is an internal fifth "unchecked" for things you did not look at.
 * The program (invoked as {{{./newspamverify.py list.submission_collection}}}) and stores (partial) results in {{{list.submission_collection.report}}}. It resumes from the saved state when invoked next time.
 * Key commands: ''SNIU'' for classification, ''arrow up/down'' scrolls message, ''arrow left/right'' next/previous message. ''B''ack moves you back to the last message you classified and erases the classification. ''Q''uit saves and exits.
 * Be aware of bugs, a log will be try to be saved (with an even longer obscure file name) if something goes wrong.
 * Inappropriate is the (misguided) term presently used for misguided ((un)subscribe to list, test messages, replies to Spam messages, vac messages, spam backscatter, probably NOT votes sent to debian-devel instead of devotee) messages, is is not entirely clear what to do with those, but please tag them accordingly.
 * Multiple people need to classify messages and if three more people flag a message as (Spam|Ham|Inappropriate), we can act accordingly. Note that this will be made public (at least to DDs) for verification of removals later.
 * There are four states {{{spam, non-spam, inappropriate, unsure}}} to use.
 * The Webinterface can be found at [[http://lists.debian.org/archive-spam-removals/review/]]. To proceed from that page you need to authorize. For now you need to be a DD, and you need to contact me for a login, Maybe the Authorisation will be later through LDAP.
 * Don't be overwhelmed by the number of articles that are nominated to review. The webinterface shows you 10 randomly chosen of it. And you should never see a rated post again.
Line 33: Line 51:
== Suggested Improvements == == Suggested Improvements and Todos ==
Line 35: Line 53:
 * Graphical viewer that essentially renders the web archive html, including the thread links for context (idea by Pabs, errors by TV).  * Because of the many false positives in the Nominations we need to make sure that
   * Webbots don't press the Spam-Button
     * this needs an analysis of the spam-button-presses.
   * Known good mails (at least the ones tagged 'Not Spam' in the Reviewer Process) should be flagged in the archive so they can not be nominated again.
 * Reworking the 'Report as Spam'-Button, so Nomination Status may be seen.
 * Provide an API to easily report Spam from a MUA to our archive.
 * Integrate the report-as-spam-Mailadress-input for Nominations.
Line 39: Line 63:
If you want to jump in, add yourself here and contact [mailto:tv@beamnet.de Thomas] for coordination. Your help is appreciated. Your help is appreciated.
Line 41: Line 65:
 * bas w.
 * giridha (looking at newer d-devel)
 * joy (looking at d-project)
 * pabs (looking at d-project)
 * tale (looking at newer d-devel)
 * Thomas Viehmann (involved in listweb, looking at too much spam)
Debian Developers need a working @debian.org-address and can start here: [[http://lists.debian.org/archive-spam-removals/review/]].
Line 48: Line 67:
== Getting program and data == non-Debian Developers can help us by pressing the 'Report as Spam'-Button at the archive.
Line 50: Line 69:
 [http://people.debian.org/~tviehmann/spam/] has a python (2.5) script and sample data, don't hesitate to bug me for more
Line 52: Line 70:
CategoryTeams CategoryTeams ## This page is referenced from http://lists.debian.org/archive-spam-removals/review/
CategoryTeams | CategoryPermalink

Spam in the Debian List Archive

Comments and suggestions are very welcome.

Status quo

It has been claimed that the Debian list archives contain spam email messages.

There is a "report as spam" button in on the list archive page of each message. The submissions seem to help (more or less) with finding spam but need manual review before they could be acted upon.

Towards a spam removal policy

Policy corner stones

  • Messages that are (beyond doubt) spam should be removed from the web archives. They should remain in the mailbox archives (and thus be accessible to developers on master.d.o).
  • Spam removals should be very conservative, with any doubt meaning no removal. For systematic removal, candidates need to be checked multiple times in order to minimize the risk of unmerited removal.
  • The information which messages have been flagged junk and how that came to be (review logs) should be accessible along with the mailbox archives, so any developer can inspect the changes to the archive and complain to listmaster about removals. This information is currently in http://lists.debian.org/archive-spam-removals/spam-removals/ .

  • On the technical side, when removing messages from the list archives URIs of messages must not change. To this end, lists.debian.org uses a version of the mhonarc mailbox converter that has been enhanced to allow skipping spam.

Ad hoc policy

Review standards should be set after seeing how things pan out.

For the start we accept three undisputed reviewers-ratings. If one reviewer has a different rating, there has to be a stronger majority. As formula:

  • it is Spam if $num{spam} > (3+2*$num{ham}+2*$num{inappropriate})

same applies for Ham and Inappropriate.

I hope this would minimize the risk of unwarranted removal. A rigorous standard seems to be necessary to obtain consensus with the project. As such, the three reviewers is only a guideline, not a rule. Of course, more reviewers doing shorter reviews would help tremendously. Ultimately, guaranteeing the integrety of the list archives currently falls in the realm of the Debian listmaster. If the numbers above don't work out, the levels can be changed easily without a hassle, previous blocked content will become available again.

Practical matters

discontinued script based effort to flag spam.

Description: ?ListMaster/ListArchiveSpam/newspamclassify.py

If you have used this one to flag spam, then your work isn't lost. Please send it in, your reports can be converted into the new internal format, so the new method knows which posts have been rated by you.

new web-based effort to flag spam.

  • Only tag as spam what really is absolutely surely spam. For example some people take offense to some comments on the lists, but that is not spam.
  • Inappropriate is the (misguided) term presently used for misguided ((un)subscribe to list, test messages, replies to Spam messages, vac messages, spam backscatter, probably NOT votes sent to debian-devel instead of devotee) messages, is is not entirely clear what to do with those, but please tag them accordingly.
  • Multiple people need to classify messages and if three more people flag a message as (Spam|Ham|Inappropriate), we can act accordingly. Note that this will be made public (at least to DDs) for verification of removals later.
  • There are four states spam, non-spam, inappropriate, unsure to use.

  • The Webinterface can be found at http://lists.debian.org/archive-spam-removals/review/. To proceed from that page you need to authorize. For now you need to be a DD, and you need to contact me for a login, Maybe the Authorisation will be later through LDAP.

  • Don't be overwhelmed by the number of articles that are nominated to review. The webinterface shows you 10 randomly chosen of it. And you should never see a rated post again.

Any suggestions on the above and/or the program are of course welcome.

Suggested Improvements and Todos

  • Because of the many false positives in the Nominations we need to make sure that
    • Webbots don't press the Spam-Button
      • this needs an analysis of the spam-button-presses.
    • Known good mails (at least the ones tagged 'Not Spam' in the Reviewer Process) should be flagged in the archive so they can not be nominated again.
  • Reworking the 'Report as Spam'-Button, so Nomination Status may be seen.
  • Provide an API to easily report Spam from a MUA to our archive.
  • Integrate the report-as-spam-Mailadress-input for Nominations.

People doing this

Your help is appreciated.

Debian Developers need a working @debian.org-address and can start here: http://lists.debian.org/archive-spam-removals/review/.

non-Debian Developers can help us by pressing the 'Report as Spam'-Button at the archive.


CategoryTeams | CategoryPermalink