Differences between revisions 42 and 100 (spanning 58 versions)
Revision 42 as of 2008-04-01 07:49:08
Size: 7162
Editor: SandroTosi
Comment:
Revision 100 as of 2021-03-25 01:40:55
Size: 6754
Editor: PaulWise
Comment: linkify the review interface
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
Note that all this is very preliminary. Comments and suggestions are very welcome. <<TableOfContents>>
Line 5: Line 5:
== Status quo == == Problem ==
Line 7: Line 7:
It has been claimed that the [http://lists.debian.org/ Debian list archives] contain spam email messages. The Debian mailing lists block a lot of spam, but occasionally small numbers of messages get through and so the [[https://lists.debian.org/|Debian list archives]] contain some spam email messages.
Line 9: Line 9:
There is a "report as spam" button in on the list archive page of each message, but presently, spam is by and large not removed from the archives. The submissions seem to help (more or less) with finding spam but need manual review before they could be acted upon. == Solution ==
Line 11: Line 11:
== Towards a spam removal policy == Mailing list subscribers and visitors to the web archive [[#nominate|nominate]] messages that they consider to be spam. Then Debian members [[#review|review]] the nominated messages and classify them according to the [[#policy|policy]]. Then the mailing list archives remove the messages from indexes and prevent access to the messages themselves.

<<Anchor(nominate)>>
== Nominating a message as spam ==

There are several options for nominating messages as spam:

 * Press the 'Report as Spam' button on spam messages in the web archives.
 * Submit the Message-Id for spam messages to the [[https://lists.debian.org/cgi-bin/nominate-for-review.pl|spam nomination web form]]
 * Use one of the [[/MUAPlugins|spam nomination mail client plugins]] to report spam messages
 * Use your mail client's bounce/resend/redirect functionality to send the spam messages to report-listspam@lists.debian.org

<<Anchor(policy)>>
== Spam removal policy ==
Line 17: Line 30:
 * The information which messages have been flagged junk and how that came to be (review logs) should be accessible along with the mailbox archives, so any developer can inspect the changes to the archive and complain to listmaster about removals. This information is currently in http://liszt.debian.org/~tviehmann/spam-removals/ .  * The information which messages have been flagged junk and how that came to be (review logs) should be accessible along with the mailbox archives, so any developer can inspect the changes to the archive and complain to listmaster about removals. This information is currently in https://lists.debian.org/archive-spam-removals/spam-removals/ .
Line 22: Line 35:
Review standards should be set after seeing how things pan out, I am aiming at three reviewers, including one experienced one (after some bootstrapping). I hope this would minimize the risk of unwarrented removal. A rigorous standard seems to be necessary to obtain consensus with the project. As such, the three reviewers is only a guideline, not a rule. Of course, more reviewers doing shorter reviews would help tremendously. Ultimately, guaranteeing the integrety of the list archives currently falls in the realm of the Debian listmaster. Review standards should be set after seeing how things pan out.

For the start we accept three undisputed reviewers-ratings. If one reviewer has a different rating
, there has to be a stronger majority. As formula:

   it is Spam if $num{spam} > (3+2*$num{ham}+2*$num{inappropriate})

same applies for Ham and
Inappropriate.

I hope this would minimize the risk of unwarranted removal. A rigorous standard seems to be necessary to obtain consensus with the project. As such, the three reviewers is only a guideline, not a rule. Of course, more reviewers doing shorter reviews would help tremendously. Ultimately, guaranteeing the integrety of the list archives currently falls in the realm of the Debian listmaster. If the numbers above don't work out, the levels can be changed easily without a hassle, previous blocked content will become available again.
Line 26: Line 47:
About using {{{newspamclassify.py}}}: <<Anchor(review)>>
=== Spam nomination review ===
Line 28: Line 51:
 * Ideally, multiple people classify messages and when everyone agrees, we can remove. For this, you send in a signed {{{*.report}}}. Note that this will be made public (at least to DDs) for verification of removals later.
 * Inappropriate is the (misguided) term presently used for misguided ((un)subscribe to list, test messages, replies to spam messages, vac messages, spam backscatter, probably NOT votes sent to debian-devel instead of devotee) messages, is is not entirely clear what to do with those, but please tag them accordingly.
 * There are four states {{{spam, non-spam, inappropriate, unsure}}} to use. There is an internal fifth "unchecked" for things you did not look at.
 * The program (invoked as {{{./newspamverify.py list.submission_collection}}}) stores (partial) results in {{{list.submission_collection.report}}}. It resumes from the saved state when invoked next time.
 * Key commands: ''SNIU'' for classification, ''arrow up/down'' scrolls message, ''arrow left/right'' next/previous message. ''B''ack moves you back to the last message you classified and erases the classification. ''Q''uit saves and exits.
 * Don't be overwhelmed by the size of the collections. They start at recent months and go back in time, so partial results are intersting as well.
   Feedback on how large submission batches you would like to review is welcome.
 * Be aware of bugs, a log will be try to be saved (with an even longer obscure file name) if something goes wrong.
 * Send ''clearsigned'' (using {{{gpg --clearsign}}}) {{{*.report}}} (possibly partial or trimmed with {{{grep -v '^ unchecked;' foo.report > 2foo.report}}} before signing, but don't accidentally erase the log when filtering) to [mailto:tv@beamnet.de Thomas]
 * Inappropriate is the (misguided) term presently used for misguided ((un)subscribe to list, test messages, replies to Spam messages, vac messages, spam backscatter, probably NOT votes sent to debian-devel instead of devotee) messages, is is not entirely clear what to do with those, but please tag them accordingly.
 * Multiple people need to classify messages and if three more people flag a message as (Spam|Ham|Inappropriate), we can act accordingly. Note that this will be made public (at least to DDs) for verification of removals later.
 * There are four states {{{spam, non-spam, inappropriate, unsure}}} to use.
 * The Webinterface can be found at [[https://lists.debian.org/archive-spam-removals/review/]]. To proceed from that page you need to authorize. For now you need to be a DD, and you need to contact me for a login, Maybe the Authorisation will be later through LDAP.
 * Don't be overwhelmed by the number of articles that are nominated to review. The webinterface shows you 10 randomly chosen of it. And you should never see a rated post again.
Line 40: Line 59:
== Suggested Improvements == == Suggested Improvements and Todos ==
Line 42: Line 61:
 * Graphical viewer that essentially renders the web archive html, including the thread links for context (idea by Pabs, errors by TV).
 * Have getting of collections and sending back reports automated in the script.
 * There must be a faster way to nominate Spam for reviewing than looking at one article, pressing the Spam-Button and then load the next. (Maybe a mutt-macro?)
 * The newspamverify-program doesn't understand MIME-Mails with base64 or HTML
Line 48: Line 63:
   * We only consider nominations with more than one nomination.
   * Known good mails (at least the ones tagged 'Not Spam' in the Reviewer Process) should not be get in the reviewing process again.
     * this needs an analysis of the spam-button-presses.
   * Known good mails (at least the ones tagged 'Not Spam' in the Reviewer Process) should be flagged in the archive so they can not be nominated again.
 * Reworking the 'Report as Spam'-Button, so Nomination Status may be seen.
 * Analyze Logfiles to identify bots pressing the 'Report as Spam'-Button.
 * Check if some Meta-Tags are helpful to steer bots to the right pages.
Line 53: Line 71:
If you want to jump in, add yourself here and contact [mailto:tv@beamnet.de Thomas] (tomv_w on IRC) for coordination. Your help is appreciated. Your help is appreciated.
Line 55: Line 73:
=== Works in progress === Debian members need a working @debian.org-address and can use the [[https://lists.debian.org/archive-spam-removals/review/|review web interface]].
Line 57: Line 75:
Our goal is to have at least three reports before removing anything. For the following lists, we have some, but not enough review reports. The people mentioned already sent in reports. Your help can most immediately used if you review lists which already have some, but not enough names listed. Please add your name after you sent in your report. Lines with no names mean that the report is ready, but no one have elaborated it (yet). Everyone else can help us by [[#nominate|nominating]] spam email messages.
Line 59: Line 77:
||<rowbgcolor="#FFFFE0"> '''List''' || '''1st Report''' || '''2nd Report''' || '''3rd Report''' ||
|| {{{debian-devel}}} || [:Appaji:Y Giridhar Appaji Nag] || || ||
|| {{{debian-www}}} || cord || SandroTosi || ||
|| {{{debian-qa}}} || cord || SandroTosi || ||
|| {{{debian-devel-italian}}} || SandroTosi || || ||
|| {{{debian-l10n-italian}}} || SandroTosi || || ||
|| {{{debian-italian}}} || SandroTosi || || ||
|| {{{debian-python}}} (2nd round) || SandroTosi || || ||
|| {{{debian-amd64}}} || || || ||
|| {{{debian-newmaint}}} || || || ||

=== People ===

 * wijnen (looked at d-project)
 * [:Appaji:Y Giridhar Appaji Nag] (looking at newer d-devel)
 * pabs (looked at d-project)
 * tale (looking at newer d-devel)
 * Thomas Viehmann (involved in listweb, looking at too much spam)
 * Michael Koch/man-di (d-java and d-user-german)

== Success stories ==

 * debian-project has had [http://lists.debian.org/debian-project/2007/11/msg00202.html some spam removed]
 * debian-python: 205/250 submitted messages removed after checking by bzed, SandroTosi, tomv
 * debian-vote: 315 spam messages removed
 * debian-java: cord, man-di, SandroTosi
 * debian-user-german: bzed, cord, man-di

== Getting program and data ==

 [http://liszt.debian.org/~tviehmann/spam/] has a python (2.5) script and sample data, don't hesitate to bug me (or any listmaster) for more.

 A browser extension is being designed by CyrilBrulebois, so as to display the messages directly from the online archives.
Some coordinated efforts for specific lists are currently being run:
 * [[DebianInstaller/SpamClean|Cleaning of debian-boot]] by the DebianInstaller team
 * [[I18n/FrenchSpamClean|Cleaning of -french lists]] by the French-speaking community ([[https://lists.debian.org/debian-devel-french/2011/01/msg00000.html|Announce mail in french]])
 * [[I18n/ItalianSpamClean|Cleaning of -italian lists]] by the Italian community ([[https://lists.debian.org/msgid-search/4CEB7C6B.6030408%40debian.org|Announce mail in Italian]])
 * [[KdeSpamClean|Cleaning of -qt-kde and -kde lists]]
 * [[Teams/Webmaster/SpamClean|Cleaning of debian-www list]]
 * [[Teams/Publicity/SpamClean|Cleaning of debian-publicity list]]
 * [[I18n/SpanishSpamClean|Cleaning of Debian lists in Spanish]] by the Spanish-speaking community ([[https://lists.debian.org/msgid-search/20110903234408.GA27748@camaleonina.octanux.com|Announce mail in Spanish]])
 * [[DebianWomen/ListSpamCleaning|Cleaning of debian-women list]]
 * [[I18n/CatalanSpamClean|Cleaning of -catalan lists]] by the Catalan-speaking community ([[https://lists.debian.org/debian-user-catalan/2012/03/msg00012.html|Announce mail in Catalan]])
Line 94: Line 89:
CategoryTeams ## This page is referenced from https://lists.debian.org/archive-spam-removals/review/
CategoryTeams | CategoryPermalink | CategoryListArchiveSpam

Spam in the Debian List Archive

Problem

The Debian mailing lists block a lot of spam, but occasionally small numbers of messages get through and so the Debian list archives contain some spam email messages.

Solution

Mailing list subscribers and visitors to the web archive nominate messages that they consider to be spam. Then Debian members review the nominated messages and classify them according to the policy. Then the mailing list archives remove the messages from indexes and prevent access to the messages themselves.

Nominating a message as spam

There are several options for nominating messages as spam:

Spam removal policy

Policy corner stones

  • Messages that are (beyond doubt) spam should be removed from the web archives. They should remain in the mailbox archives (and thus be accessible to developers on master.d.o).
  • Spam removals should be very conservative, with any doubt meaning no removal. For systematic removal, candidates need to be checked multiple times in order to minimize the risk of unmerited removal.
  • The information which messages have been flagged junk and how that came to be (review logs) should be accessible along with the mailbox archives, so any developer can inspect the changes to the archive and complain to listmaster about removals. This information is currently in https://lists.debian.org/archive-spam-removals/spam-removals/ .

  • On the technical side, when removing messages from the list archives URIs of messages must not change. To this end, lists.debian.org uses a version of the mhonarc mailbox converter that has been enhanced to allow skipping spam.

Ad hoc policy

Review standards should be set after seeing how things pan out.

For the start we accept three undisputed reviewers-ratings. If one reviewer has a different rating, there has to be a stronger majority. As formula:

  • it is Spam if $num{spam} > (3+2*$num{ham}+2*$num{inappropriate})

same applies for Ham and Inappropriate.

I hope this would minimize the risk of unwarranted removal. A rigorous standard seems to be necessary to obtain consensus with the project. As such, the three reviewers is only a guideline, not a rule. Of course, more reviewers doing shorter reviews would help tremendously. Ultimately, guaranteeing the integrety of the list archives currently falls in the realm of the Debian listmaster. If the numbers above don't work out, the levels can be changed easily without a hassle, previous blocked content will become available again.

Practical matters

Spam nomination review

  • Only tag as spam what really is absolutely surely spam. For example some people take offense to some comments on the lists, but that is not spam.
  • Inappropriate is the (misguided) term presently used for misguided ((un)subscribe to list, test messages, replies to Spam messages, vac messages, spam backscatter, probably NOT votes sent to debian-devel instead of devotee) messages, is is not entirely clear what to do with those, but please tag them accordingly.
  • Multiple people need to classify messages and if three more people flag a message as (Spam|Ham|Inappropriate), we can act accordingly. Note that this will be made public (at least to DDs) for verification of removals later.
  • There are four states spam, non-spam, inappropriate, unsure to use.

  • The Webinterface can be found at https://lists.debian.org/archive-spam-removals/review/. To proceed from that page you need to authorize. For now you need to be a DD, and you need to contact me for a login, Maybe the Authorisation will be later through LDAP.

  • Don't be overwhelmed by the number of articles that are nominated to review. The webinterface shows you 10 randomly chosen of it. And you should never see a rated post again.

Any suggestions on the above and/or the program are of course welcome.

Suggested Improvements and Todos

  • Because of the many false positives in the Nominations we need to make sure that
    • Webbots don't press the Spam-Button
      • this needs an analysis of the spam-button-presses.
    • Known good mails (at least the ones tagged 'Not Spam' in the Reviewer Process) should be flagged in the archive so they can not be nominated again.
  • Reworking the 'Report as Spam'-Button, so Nomination Status may be seen.
  • Analyze Logfiles to identify bots pressing the 'Report as Spam'-Button.
  • Check if some Meta-Tags are helpful to steer bots to the right pages.

People doing this

Your help is appreciated.

Debian members need a working @debian.org-address and can use the review web interface.

Everyone else can help us by nominating spam email messages.

Some coordinated efforts for specific lists are currently being run:


CategoryTeams | CategoryPermalink | CategoryListArchiveSpam