What should an organisation monitor in the cloud platform?

  • Network
  • Infrastructure
  • Application performance
  • End-user activity and experience
  • Change
  • Log
  • Data and DevOps
  • SLO and SLI
  • Error budget
  • Toil reduction
  • Data activity
  • Complaisance
  • User activity and access control

Network Monitoring

Monitoring solutions should collect data to monitor the following items:

  • Average Network Round Trip Time
  • Client Time
  • Network Latency
  • Total Network Incoming Traffic
  • Total Network Outgoing Traffic
  • Total Network Response Time
  • Total Server Time
  • Activity / Request Response
  • Availability
  • HTTP errors and many more
  • Network read/write/io
  • Inbound data connection
  • Outbound data connection

Infrastructure Monitoring

Monitoring solutions should collect data to monitor the following items:

  • Compute health
  • Storage health
  • Resource usages
  • Low Disk Space
  • Low Virtual Memory
  • Application and job run trend analysis
  • Background Process Crash and trend
  • Battery Wear
  • HD Failure
  • System Crash
  • Unexpected Shutdown
  • Connectivity disconnect
  • Minor and major update Failure

Security Monitoring

  • Monitor access control system records and logs for all corporate systems, i.e. firewalls, routers, Intrusion Detection Systems (IDS)/Intrusion Prevention Systems (IPS), Database (DB) and web servers
  • Establish a Security Information and Event Management (SIEM) tool to detect/deter/filter recognised patterns of potential intrusions and incursions, and the SIEM may initiate designated response actions.
  • Create an alert for all unexplained access or attempted access events, whether physical or logical
  • Activity logs of all attempts (successful or failed) to physically enter the corporate premises or logically access the IT infrastructure, including any network, system or applications (whether through direct or remote EPS access), will be subject to regular auditing and monitoring.
  • A central repository should be developed where Security logging data should be transferred to the repository in a secure manner which will maintain the integrity and confidentiality of the information
  • Monitoring of customer and third-party communications in accordance with the laws and regulations

Audit Log

  1. user activities, exceptions and Information Security events must be audited periodically
  2. The following data will be recorded where permissible:
  • User IDs
  • Dates, times and details of key events, i.e. log-on, log-off
  • Terminal identity or location
  • Records of successful and rejected system access attempts;
  • Records of successful and rejected data and other resource access attempts;
  • Changes to system configurations;
  • Use of privileges
  • Use of system utilities and applications;
  • File access and the kind of access;
  • Alarms raised by access control systems;
  • Activation or deactivation of protection systems, i.e. anti-virus, intrusion detection/prevention systems;
  • Protocols used if known;
  • Port information, where available; and
  • System or application processes, if available.
  • The time at which an event occurred;
  • Whether the event was a success or failure;
  • Information about the event;
  • Which account and which Administrator undertook the task; and
  • Which process was invoked?
  • The system records any significant service event or outage description;
  • When (date and time) an event occurred;
  • Where the event occurred;
  • The outcome (success or failure) of the event; and
  • The identity of the user/subject associated with the event.
  • protected against tampering and unauthorised access, or from being deactivated, modified or deleted;
  • encrypted at rest and in motion
  • backed-up and archived
  • retained such that a minimum of three months' history can be provided for immediate analysis, with a further provision of a year's records held in the archive.
  • Identifying the cause of the information leakage and the type and quantity of data 'At Risk';
  • Enabling the isolation of information systems that are leaking data
  • Identifying any other information or information systems that may have been contaminated subsequently.

Log Review Cycle

A process to review the logs should be established in the system following the steps described follows.

  • The logs of other systems, those holding non-sensitive data, will be reviewed weekly.
  • The logs of systems holding non-sensitive Public data should be reviewed fortnightly.
  • Only appropriately authorised persons within the company or specifically contracted third parties will review security log files.
  • The logs of systems holding Confidential data, providing connectivity to hosts or networks containing such data, or any system facing the Internet, must be reviewed daily.

Event Monitoring

Events monitoring system should

  • logs all events especially fault events
  • make sure that appropriate actions are taken to safeguard data and to record events/activities taken accurately
  • record corrective actions to make sure that controls have not been compromised and that the action taken is authorised

Log Infrastructure Monitoring

Log management activities will require that the log reviewer should monitor the log output to make certain that:

  • Log rotation and archival processes will be developed
  • Each system's clock is synchronised to a common time source so that its timestamps will match those other systems generate.
  • Logging infrastructure can be re-configured as needed to reflect the change in policies.
  • Detection and documentation of anomalies in log settings, configurations and processes are to be maintained;
  • An alert will be produced and disseminated in the event of an audit processing failure.

Service Level Monitoring

For SL monitoring following information should be collected

  • Number of open and overdue incidents
  • Number of open incidents that should be resolved in time
  • Percentage of open and overdue incidents
  • Number of incidents resolved in time
  • Number of incidents resolved that should have been fixed in time
  • The number of incidents that should have been resolved in time is measured daily as unit #.
  • Percentage of incidents resolved in time
  • Average Resolution Time In Hours For Resolved Incident SLA Tasks
  • Total resolved Incident SLA tasks
  • Number of incident assignments responded to in time
  • Number of incident assignments that should have been answered in a time
  • Number of open and overdue incident assignments
  • Number of open incident assignments that should be responded to in time
  • Percentage of incident assignments responded to in time
  • Percentage of open and overdue incident assignment
  • Summed duration time of resolved incident SLA tasks in hours

End-User Experience Monitoring

End-user interactions in an application are simulated and executed periodically from different locations to calculate the availability and performance of an application. Business Process Monitoring addresses the process needed to monitor applications for availability and performance. There are two options available for monitoring: Transaction Monitor and URL Monitor.

Key Performance Indicator (KPI)

Defining the metrics, threshold and KPI for monitoring solutions is essential. Metrics can be defined and analysed, including usage data either on the infrastructure side (CPU and memory usage) or on the application side (for example, number of specific API calls, number of errors, etc.).​

  • Notification and alerting can be added for any of the monitored metrics. The solution must then be adopted to ensure that the necessary raw data for the defined KPIs are collected. This can mean logging specific data, storing particular data in a database or adding specific data generation and aggregation into the implementation.
  • Metrics can be defined and monitored (for example, track the number of errors that occur in the application logs and send a notification whenever the rate of errors exceeds a specified threshold).
  • Health rules: A set of health rules should be defined to state the healthy and faulty condition of the environment. Health rules are used to determine the KPIs (key performance indicator) metric thresholds for your application across the stack.
  • Actions: Actions are used to specify what should be done in various situations, including sending email alerts and running custom scripts for performing diagnostic/remedial tasks.
  • Policies: Policies link your health rule violations and other performance-based events with appropriate actions to trigger.
  • Alert Suppression: A mechanism by which the triggering of alerts during known outages and planned downtimes could be avoided.
  • Instrumentation: The process of specifying what to monitor (mainly for Java and .Net runtime) is known as instrumentation. For example, you can provide custom classes/methods in Java.
  • Distributed Tracing: While metrics may be sufficient for understanding overall performance aggregated across general dimensions, they aren't enough to understand a request's lifetime across multiple systems. Distributed tracing brings visibility to the lifetime of a request across several systems.

Dashboard

Dashboards focus on resource utilisation and the 4 SRE golden signals required. Here are some characteristics of an ideal dashboard:

  • Central Overview is essential.
  • SLI and SLA Dashboard can improve the performance monitoring by indicating the following features along with the unique identifier
  • Service name
  • Method
  • API version
  • Credential ID
  • Location
  • Protocol (HTTP / gRPC)
  • HTTP Response Code (e.g. 402)
  • HTTP Response Code class (e.g. 4xx)
  • gRPC Status Code
  • System and region transaction count
  • system CPU use trend
  • storage violation
  • Job CPU use trend
  • Region transaction count
  • Region count
  • Region CPU time
  • Region Storage violation
  • Transaction Count
  • Transaction CPU
  • Transaction Elapsed, Dispatch
  • Response time
  • Wait time
  • Transaction storage violation
  • Storage message
  • Transaction Count
  • Transaction CPU
  • Transaction Elapsed, Dispatch
  • Response time
  • Wait time
Hybrid Cloud Infrastructure and Operations Explained: Accelerate your application migration and modernization journey on the cloud with IBM and Red Hat

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store