Differences between revisions 9 and 10
Revision 9 as of 2014-09-10 13:06:10
Size: 7840
Comment: New UK data retention info
Revision 10 as of 2015-11-17 00:15:03
Size: 7897
Editor: MattTaggart
Deletions are marked like this. Additions are marked like this.
Line 64: Line 64:
  * nginx: DebianBug:805322 - only keep 14 days of logs

Many Debian packages produce logs. We'd like to make the logs we produce respect the privacy and reflect the needs of our users.

Logging Guidelines

Proposed guidelines for how debian packages should behave:


  • by default - don't make any attempt to comply with data retention laws? - put the burden of that on companies who believe they are actually subject to such laws, in their jurisdiction

What things do we actually need

To make an abuse report to an ISP, you obviously need to know the IP, and they'll want to know the timestamp and some evidence:

  • for network attacks (DoS), maybe just some packet headers to show type/volume of traffic - may not need to record the payload data at all? this kind of activity is typically followed up immediately or not at all, so no need to log anything like this for more than one working day?
  • for web application attacks, typically a log line from a webserver or IDS. these kind of logs are often desired for several days/weeks in case a compromise is not seen right away
  • for incoming email, it typically must have the full message headers, which we don't usually log in full. (So, why log email sender address and IP, if you have it in the message already?)

Some things to avoid logging

  • passwords - are quite often discarded from logs currently
  • IP addresses - if they must be logged, perhaps they can be anonymised (see plugins for Apache etc.); maybe IPv6 addresses could be truncated to /64 or smaller - particularly to avoid logging EUI-48s (MAC) from SLAAC addresses?
  • MAC addresses - sometimes found in firewall log entries, DHCP logs (even persistently, in the DHCP leases file?); especially with wireless LANs, often uniquely identify devices that were seen on the network and when
  • results of wireless network or Bluetooth scanning? my phone seems to permanently log the latter with a timestamp
  • HTTP_REFERER? think I recently saw something from yahoo.com that, in the query string, identified the cell tower the client's device was associated with?! certainly didn't ask for this data
  • user agent? can be very useful to a web developer, but sometimes can be quite identifying (see https://panopticlick.eff.org )

  • email subject - yes, cPanel/WHM servers do this :/
  • IRC/XMPP/other IM chats... maybe? by default, do not log, or always ask the user if they want to do this?
  • core dumps?
  • file atimes? just because
  • swap space - encrypt with one-time key? because almost nothing we've mentioned so far will be protected by the application with mlock()

Developer Resources

If your package produces logs:


  • applications may consider using syslog, instead of their own log files? syslog/logrotate/journald probably have more features, but in particular allow to centralise the configuration, such as retention policy

Upstream Resources

If you're upstream, how can you make your package meet these same goals:


Reported bugs

Bugs have been reported on those packages to fix this:

  • apache2: 759382 - proposed 7 days retention

  • nginx: 805322 - only keep 14 days of logs

TODO: create a usertag to track those instead of listing them in the wiki here.

Bugs to report

The following packages should be addressed:


  • kismet -- could it be more conservative in what data it stores until you specifically ask for it? does it still default to logging all packets, including data? (isn't this how Google Street View got into trouble for 'accidental' Wi-Fi snooping?)
  • snort -- suspicious packets are captured and saved to disk by default, but if a false-postive occurs there could be personal data written, for... how long? (I'm seeing back to 2012 on one system I looked at!)
  • squid3 -- logs clients' web traffic by default, should we really?
  • awstats and similar -- log parsers could mean that IPs, URLs, referrer, user agent etc. are stored indefinitely even after the original log is erased
  • logwatch -- generates email summaries (something permanent) from log files, often includes IPs


  • EFF Best Practices for Online Service Providers (references mostly US law, but the bullet-point general recommendations are universal): https://www.eff.org/wp/osp

EU member countries

(disclaimer: I am not a lawyer)

The EU Data Retention Directive was adopted in 2006; in 2014 it was ruled as 'invalid' by the EU Court of Justice.

When first introduced, some IT admins scrambled to increase the amount of logging they were doing - especially in Email services and such (is there an example in the BTS for Exim?) - in case they were (or soon would be) obliged to retain data and be ready to share it with authorities.

Some countries may have put data retention obligations into national law; I'm not sure how the EU ruling affects that.

In the UK, the draft Data Retention and Investigatory Powers Act 2014 clearly intends to work around the court ruling. It would allow a 'public telecommunications operator' to be served with a notice to retain for up to 1 year... (TL;DR) pretty much anything. (The DRIP Act actually says 'subscriber data' and 'traffic data', but references RIPA 21(4) for a definition of the latter. RIPA 21(4) says 'any traffic data comprised in or attached to a communication' so it doesn't seem limited to metadata).

The good news is that, even if this legislation is passed, it seems nobody in the UK is obliged to retain anything until served with a notice from the Secretary of State.

Furthermore, the UK's Data Protection Act requires that personally identifying data only be collected when it's necessary for a particular stated purpose, and that a data subject can request a copy of this data (which sounds like hard work to deal with), or request its deletion. Certainly for businesses, it's a liability to be collecting too much data in case it is compromised. Use this as an argument against management if you're being asked to retain a questionable amount of data.

I recall someone's (Google's?) legal argument that IP addresses in webserver logs are not personally identifying. But if the data being collected under the EU Directive was thought to be useful to law enforcement, then surely it is. We've seen that having enough data, it eventually becomes personally identifying, especially when cross-referenced with other sources.

In the IPv6 world, SLAAC addresses could include the OUI-48 (MAC) of a device, uniquely identifying someone's phone for example. It will be difficult to argue the Data Protection Act still doesn't apply to IPv6 addresses in logs.

stevenc: personally I suggest the more careful approach of not logging any more than you really want to, in any legal jurisdiction, until you become intimidated into doing so. Take a stand - don't make data retention the norm, so that if someone seeks to put it into law, make sure that's going to be difficult to enact and rarely complied with in practice. (IMHO the Data Protection Act is almost never complied with).

On your own, personal systems I don't see much reason to comply with data retention laws. You, or people you care about might be the only data subjects. Data coming from your own systems can't be trusted to defend you in court; but it can and will be used to incriminate you (e.g. recent cases of Google search queries used in UK and US courts to infer state of mind). Maybe log to tmpfs, if at all, and/or have logrotate (compress and) encrypt with gpg if you must keep it for a long time.