RFC - Design Doc - Machine Downtime Recording

Machine Downtime Recording

Nominated Owners: @andy.german @geoff


In a manufacturing facility, time is money. Any time lost is expensive, and understanding lost time and working on ways to reduce lost time is a high priority with every manufacturing company.
A core function of most MES applications is tracking lost time and the reasons for it. There are various standards that apply to part or all of the problem. PackML is a standard for defining machine states, ISA22400 is a standard that defines the KPI calculations that can be performed from the recorded data, and ISA95 Capability Model covers ways to think about available capacity, committed capacity, consumed capacity and wasted capacity.

The requirements from most manufacturing companies contain the following:

Recording machine state from the machine

  • If my machine stops, the MES will create a record of when it stopped and the duration of the stoppage.
  • If my machine knows why it stopped, the MES will accept and record this reason for the stoppage,
  • If my machine runs more slowly than expected, the MES will capture the lost time due to the machine running slowly (There can be 2 forms of this. Cycle time calculations for discrete process steps, and monitoring the current speed against a target speed for continuous processes)

Adding context to the delay record to improve data quality

  • The MES will understand the current shift and crew working on the machine, and will record this context with the delay record.
  • The MES will understand if the machine is scheduled to run, and if not, will automatically classify the delay record
  • The machine may have been stopped for multiple reasons (Like a product change that runs over into a lunch break) so the MES user needs to be able to split delay records to classify the times separately
  • Users may have incorrectly split records, so the MES needs to allow split records to be merged back together

Maintaining Reason / Cause lists

  • Many companies aim to have a standard list of delay reasons so that reports are comparable across sites, but continuous improvement at a site means focusing on resolving issues. This can mean that a site or machine team will want to create more detailed reasons to track progress against specific issues.
  • Reasons can then form a hierarchy, with enterprise wide standard reasons at the top levels of the hierarchy, and site or machine type specific reasons optionally being added at the lower levels of the hierarchy.
  • Some downtime tracking systems create a relationship between reasons and equipment classes. Not all reasons are relevant to all equipment classes. A “Conveyor Jam” would not make sense on an equipment that has no conveyor for example.

Workflow based on the delay

  • if a delay exceeds a predetermined threshold, escalate the delay to supervisors
  • If the machine operator requires assistance, request the assistance from the appropriate department (ANDON)
  • If the delay has known work that is required to be performed (like a product change) and that known work has an allowable duration, and the delay exceeds that allowable duration, automatically split the delay after the allowable duration so that the user is required to explain why the work took longer than the allowable duration
  • If the delay has known work that is required to be performed, and that work is defined with actions that need to be taken, like a pre-start check-list, allow the user to complete the checklist.

Workflow for continuous improvement

  • In a daily meeting, review the days delay and record any actions that need to be performed, with the assigned person and due date.
  • Review the open actions
  • Identify recurring issues through pareto analysis of frequency of problems. Review the actions being taken to address recurring issues

Calculate KPIs based on the recorded information

ISO22400 provides detailed descriptions of many standard KPI for machine performance.



It is a core MES/DataHub feature that is expected by almost all customers.

Guide-level explanation

Reference-level explanation


Rationale and alternatives

Prior art

We have several different implementations of Downtime Recording across

This is one: https://www.notion.so/Time-series-OEE-Data-Model-746391c153844699bcbf84a0817decfe

There is another that uses the State Models and records state transitions in the graph database.

Unresolved questions

  • Should state transitions be stored as time-series or in the graph database?
  • How should we manage the extensibility of reasons in the reason tree?

Future possibilities