#687 new defect

too many "false alarms" in incident reporting

Reported by: zooko Owned by: somebody
Priority: major Milestone: eventually
Component: code Version: 1.4.1
Keywords: error logging usability foolscap Cc:
Launchpad Bug:

Description

There are too many "incidents" being reported, making it harder for people to tell if there is anything really wrong. I think most of these are probably networking issues -- Tahoe is getting used in more and more places (e.g. #686) and the networks there don't behave the same as the networks in the allmydata.com grid. One reasonable policy would be "Nothing the network does is worth creating an incident report about.".

Change History (2)

comment:1 Changed at 2009-04-26T20:51:39Z by warner

yeah. part of the reason for producing Incidents is to learn which exceptions are happening frequently so we can understand and downgrade them. If a specific kind of incident is happening a lot, then either it needs to be fixed or ignored (well, the specific log message that triggers the event should be reduced in severity, below the threshold which triggers incident reporting).

How many incidents are you seeing? And what is triggering them? The Foolscap package provides a CLI tool named "flogtool", and you can run "flogtool dump INCIDENTFILE" to read the contents of the incident file. grep for "TRIGGER" to see the specific event that triggered the Incident (the incident record includes events before and after the trigger).

There is an incomplete set of tools for collecting and classifying Incidents, which includes code to label each incident with a "category" and sort them that way. The longer term goal is to produce a web page which shows recent Incidents, and how many Incidents of each category have been produced, to make it more obvious which ones are false-positives (and should be fixed by complaining less) and which ones are significantly unusual (and should be fixed by addressing the bug).

The overall goal is to highlight things that need attention. Getting a corrupt share from a server seemed to fall into this category: although we can handle it just fine, the odds of it happening are so low that somebody should look into it (either to tell the server operator that they're having disk problems, or to tell the server operator to please stop scribbling on your shares, or to run a memory tester on your own machine, or something).

But having a server go away during the download should certainly not trigger an Incident.. that's the sort of log message which needs to be deprioritized.

comment:2 Changed at 2010-04-04T16:48:45Z by davidsarah

  • Keywords error logging usability foolscap added
  • Milestone changed from undecided to eventually
Note: See TracTickets for help on using tickets.