The Importance of Performance, Reliability, and Scalability in Modern MES

Introduction: Why the bar just got higher

Manufacturing today operates at the speed of software, where shifts are faster, data is constant, and the cost of failure is higher than ever. Teams are coordinating across multiple sites, products turn over faster, and the amount of data flowing from people, parts, machines, and suppliers keeps multiplying. In that environment, a Manufacturing Execution System (MES) cannot simply keep the lights on. It must be relentlessly fast, available when needed, and ready to scale. Otherwise, it becomes a bottleneck that slows production and erodes trust.

The stakes are real. According to Oxford Economics, unplanned downtime costs the Global 2000 (the world’s largest companies) $400 billion annually. For manufacturing specifically, Aberdeen Research has found that unplanned downtime costs U.S. manufacturers an average of $260,000 per hour. In the automotive industry, where production lines are highly complex, a single hour of downtime can reach as high as $2.3 million. When production stops, so do cash flow and commitments. That’s why performance and reliability aren’t “nice to have” qualities for an MES; they’re financial and compliance imperatives.

Legacy MES deployments struggle here. Many were built for a slower era, not for real-time data, multi-site coordination, or today’s constant demand for integration. The result is slow dashboards, fragile integrations, and performance problems that surface at the worst possible times.. When that happens, operators wait, supervisors lose visibility, and leadership loses confidence.

This post shares what we (and industry leaders) measure, the standards we hold ourselves to, and a practical checklist you can use to evaluate your current MES—no matter who your provider is.

What reliability really means (in plain English)

Reliability means your system is there when you need it, keeps data accurate, and recovers quickly when something breaks. In a factory, that can be the difference between meeting a shipment deadline or missing a quarterly goal. For an MES, that translates to being up when the factory needs it, processing transactions accurately, and recovering quickly when something goes wrong. Availability (often stated as “uptime”) is one measure of reliability, but others include data integrity, error rates, and recovery times.

A quick reference for uptime “nines”:

99.9% (“three nines”) ≈ 8.76 hours of potential unavailability per year
99.99% (“four nines”) ≈ 52.6 minutes/year
99.999% (“five nines”) ≈ 5.26 minutes/year

For a plant that runs two or three shifts, “just a few hours” of downtime can be the difference between hitting a shipment window or missing a quarter’s target.

Reliability isn’t only a software concept; it’s rooted in manufacturing standards and discipline:

ISA-95 clarifies the role of MES at Level 3 (manufacturing operations management) and its interfaces to Level 4 business systems like ERP. Clear boundaries and integrations reduce failure modes caused by “who owns what” ambiguity. The ISA-95 Standards for Enterprise–Control System Integration describe how to bridge enterprise planning with manufacturing execution.
ISO 9001 formalizes the quality management system (QMS) that sustains reliability, through corrective and preventive action (CAPA), audits, and continuous improvement. ISO 9001:2015 outlines how organizations can standardize processes, minimize errors, and build a foundation of continuous improvement across manufacturing operations.

IEC 62443 sets the global cybersecurity requirements for industrial automation and control systems (IACS). Stronger security postures directly reduce operational outages and protect data integrity, making cybersecurity inseparable from uptime. ISA/IEC 62443 defines best practices for securing IACS across industries and provides a framework for resilience throughout the system lifecycle.

In practice, reliability means consistent uptime, accurate data, secure operations, and quick recovery, all supported by disciplined processes across IT and OT.

Performance and scalability, without the jargon

Performance is the speed and responsiveness operators and supervisors feel in the moment. It shows up in whether dashboards load instantly or stall under peak demand. That difference is felt on every shipment and every quarter. Scalability refers to the ability to grow products, users, sites, and data volumes without slowing down or re-architecting the entire system.

A practical way to measure performance is by looking at percentiles, such as how fast the system responds 95% or 99% of the time. This is often written as P95 or P99. Averages can hide the bad moments, and if the slowest responses happen during peak hours, operators will feel it, compounding into missed shipments and wasted hours. Using percentiles within SLIs (Service Level Indicators) and setting SLOs (Service Level Objectives) is a practical way to capture these distributions.

Human factors matter too. People generally perceive system response times in three key ranges: about 0.1 seconds feels “instant,” about 1 second preserves a sense of flow, and around 10 seconds starts to break attention.. These limits are a useful guide for setting performance targets: critical reads should feel sub-second, while complex, multi-step operations may take longer but should always show progress and never block the line.

Where can legacy MES deployments bog down?

Blocking queries and N+1 patterns on hot paths
Unindexed reporting that competes with transactional workloads
Monolithic architectures that scale “up,” not “out”
Opaque integrations that amplify latency and failure across systems

Modern MES platforms avoid these traps with horizontal scaling, read replicas, short-TTL caching, back-pressure, and streaming/event architectures—paired with production-grade observability to catch issues before users do. A practical way to track system health is through the Four Golden Signals: response speed, traffic volume, error frequency, and resource strain.

An observer’s perspective: what we measure and how we hold ourselves accountable

The most credible MES providers share the same metrics they track internally. That way, customers can validate results with their own teams and bring clear data to leadership. Concretely, the categories that matter most for an MES are:

Uptime/Availability: measured monthly and quarterly, with clear SLOs and incident reviews.
Latency distributions (P95/P99) for APIs and key dashboards: measured under real traffic, including peak windows.
Throughput: events/second across ingest, workflow transitions, and traceability writes.
Resilience & Recovery: RTO (how fast we recover) and RPO (how much data we can afford to lose), plus the cadence and results of DR tests.
Error rates and saturation: the rest of the golden signals that predict user pain.

This isn’t theoretical. Cloud reliability frameworks provide a playbook: design for failure, define SLIs/SLOs, and test DR regularly—including multi-AZ/Region patterns when business demands require it.

Even if you don’t publish every datapoint, the habit of instrumenting, reviewing, and communicating these metrics builds trust with plant leadership and IT/OT alike. It also helps teams make sane trade-offs (for example, when to precompute vs. compute on demand, or when to invest in a hot standby).

Industry baselines you can use (and ask any vendor to show)

When evaluating a modern MES (or reviewing your own), it’s reasonable to anchor expectations to well-documented cloud and software standards:

Availability: As context, AWS EC2’s region-level SLA commits to 99.99% when instances are distributed across two or more Availability Zones. RDS Multi-AZ advertises ~99.95%, and Aurora Multi-AZ commits to 99.99%. These aren’t MES targets per se, but they establish a sensible ceiling for what the underlying infrastructure can support.
Latency: Google’s SRE framework encourages percentile-based targets—for example, P95 latency in the hundreds of milliseconds for critical reads, and P99 around or under ~1 second for interactive dashboards. These guardrails ensure systems feel responsive under real-world conditions.
The AWS Well-Architected Reliability Pillar and Disaster Recovery whitepapers outline approaches like backup/restore, pilot-light, warm-standby, and multi-site active/active. Each comes with different RTO/RPO and cost trade-offs. What matters is defining clear objectives for each workload and testing against them regularly.

These references are not vendor claims. They are public benchmarks that you can bring into internal conversations to prove your MES is meeting or exceeding industry standards

A manufacturer’s checklist to evaluate your MES

Use this list to cut through the noise in vendor evaluations. If you are already a customer, use it to benchmark your system and show progress to your teams

Uptime & Reliability

‍Why it matters: Stops, delays, and degraded modes cost real money and erode trust.
What good looks like: At least 99.9% availability for mission-critical operations; leaders aim toward 99.99%, backed by multi-AZ designs and clear incident reporting. Ask for a 12-month availability log and post-incident reviews.
Speed at Scale (Tail Latency)

Why it matters: Operators and supervisors experience the worst 1–5% of interactions at the most stressful moments.
What good looks like: Publish P95/P99 latency for key APIs and dashboards under peak load, not demo data. Aim for a sub-200–300 ms P95 for hot reads and ≤1 s P99 for interactive dashboards; long operations must display progress and never block the line.
Throughput (Events per Second)

Why it matters: As products, stations, and teams scale, write volume and fan-out climb.
What good looks like: Clear ingest and workflow EPS targets with no SLO regressions at 2× and 5× today’s volume. Ask for load-test artifacts (traffic patterns, data sizes, and results).
Real-Time Visibility

Why it matters: Supervisors need live, trustworthy signals to prevent stoppages and defects.
What good looks like: Dashboards remain responsive during batch jobs and spikes; architecture uses indexing, caching, back-pressure, and read-optimized paths. Map metrics to the Four Golden Signals: latency, traffic, errors, and saturation.
Disaster Recovery & Failover

Why it matters: Disasters are rare; recoveries should be rehearsed.
What good looks like:Documented RTO/RPO per workload, regular DR tests with outcomes, and architectures appropriate to the business (backup/restore → pilot-light → warm-standby → multi-site active/active).
Security That Protects Uptime

Why it matters: Security incidents are uptime incidents.
What good looks like:Alignment to ISA/IEC 62443 for industrial automation and control systems (IACS), real incident response runbooks, and segmented access that limits blast radius.
Standards-Aligned Architecture

Why it matters: Clean Level 3–4 boundaries prevent finger-pointing and fragile integrations.
What good looks like: Clear adherence to ISA-95 roles and interfaces, with tested connectors to ERP/PLM/quality systems.
Quality System Discipline

Why it matters: Sustained reliability requires feedback loops, not heroics.
What good looks like: ISO 9001 practices for CAPA, audits, and continuous improvement reflected in MES change management.
Transparency & Reporting

Why it matters: Trust compounds when you can see the numbers.
What good looks like: Vendors publish SLIs/SLOs (uptime, P95/P99, throughput) and share post-incident analyses.
Proof Under Load

Why it matters: Demos are not production.
What good looks like:Vendor supplies load-test plans and results using your scale assumptions and data shapes.

How First Resonance approaches this (lightly, by example)

From the beginning, we’ve favored an “observer perspective”: publish what we measure and invite customers to hold us to it. In practice, we treat uptime, P95/P99 latency, throughput, and recovery objectives as first-class SLIs, and we review them with customers on a regular cadence.

Architecturally, we align with well-understood cloud reliability patterns—multi-AZ databases, short-TTL caching with smart invalidation, streaming for hot paths, and read-optimized APIs—so the platform stays responsive at peak. We map our observability to the Four Golden Signals and maintain a disaster recovery plan that is tested, documented, and adjusted as customers scale. None of this is unique to us; it reflects what modern MES leadership requires.

Conclusion: Accountability is the new differentiator

Performance, reliability, and scalability are not abstract engineering ideals. They are the difference between confident, first-time-right builds and a day of triage. With downtime costing industries billions per year and single hours reaching seven figures in some sectors, the bar for MES is rising fast. Buyers should use the checklist above to assess new systems. Customers should use it to validate results and reinforce wins inside their own organizations. Either way, expect vendors (including us) to present real numbers, not just claims.

If you’d like to see the standards we hold ourselves to—and how we test against them in environments like yours—reach out. We’re happy to share details and compare notes.

The Importance of Performance, Reliability, and Scalability in Modern MES

Introduction: Why the bar just got higher

What reliability really means (in plain English)

Performance and scalability, without the jargon

An observer’s perspective: what we measure and how we hold ourselves accountable

Industry baselines you can use (and ask any vendor to show)

A manufacturer’s checklist to evaluate your MES

Uptime & Reliability

Speed at Scale (Tail Latency)

Throughput (Events per Second)

Real-Time Visibility

Disaster Recovery & Failover

Security That Protects Uptime

Standards-Aligned Architecture

Quality System Discipline

Transparency & Reporting

Proof Under Load

How First Resonance approaches this (lightly, by example)

Conclusion: Accountability is the new differentiator