profile picture

Use OpenTelemetry to Auto Instrument WordPress

2022 November 01 - 1409 words - 8 mins - opentelemetry open source blog

OpenTelemetry is often associated with modern microservices and cloud-native applications. What happens if we apply OpenTelemetry and modern observability techniques to something completely different? WordPress is the world's most popular weblog software. And it's also an almost 20 years old monolith. What happens if we use OpenTelemetry auto instrumentation and the Splunk Observability cloud?

Imagine you are responsible for running WordPress sites. What insights can we bring with modern tools to a popular monolith? Just by instrumenting the environment, without any changes to the WordPress code.

This is exactly what I will show you, please follow along!

Setup

We run WordPress and the accompanying database in docker. We will not modify the WordPress code in any way. We add auto instrumentation to the PHP runtime as a module. We send the metrics and traces to Splunk APM.

If you want to dive in directly, all code shown in this blog can be found in my WordPress Auto Instrument repo.

Add auto instrumentation to stock WordPress container image

We use the open-source PHP tracing module to do the tracing. In order to install this we:

Configure local environment

We add (sensitive) environment variables to a file called .env It's best to add this file to your .gitignore and .dockerignore.

The SPLUNK_ACCESS_TOKEN and SPLUNK_REALM are specific to your Splunk Observability Cloud account. Get a free trial to try this out for yourself.

MYSQL_ROOT_PASSWORD=your_root_password
MYSQL_USER=your_wordpress_database_user
MYSQL_PASSWORD=your_wordpress_database_password

SIGNALFX_SERVICE_NAME=wordpress
SIGNALFX_ENDPOINT_URL=http://splunk-otel-collector:9411/api/v2/traces

SPLUNK_MEMORY_TOTAL_MIB=1024
SPLUNK_ACCESS_TOKEN=<place your token here>
SPLUNK_REALM=us0

Create docker compose file

We base our docker compose file on the sample provided by docker.

For old times' sake, we use a MySQL image, instead of the default mariadb. And we don't need to expose the database ports, we connect both containers via a dedicated network.

version: '3'

services:
  db:
    image: mysql:8.0
    container_name: db
    restart: always
    env_file: .env
    environment:
      - MYSQL_DATABASE=wordpress
    volumes:
      - dbdata:/var/lib/mysql
    command: '--default-authentication-plugin=mysql_native_password'
    networks:
      - app-network

For WordPress we add a build configuration, so our own Docker file is used to build the container.

This adds our auto instrumentation to the stock WordPress container.

  wordpress:
    depends_on:
      - db
    image: traced-wordpress:latest
    build: ./traced-wordpress
    container_name: wordpress
    restart: always
    env_file: .env
    environment:
      - WORDPRESS_DB_HOST=db:3306
      - WORDPRESS_DB_USER=$MYSQL_USER
      - WORDPRESS_DB_PASSWORD=$MYSQL_PASSWORD
      - WORDPRESS_DB_NAME=wordpress
    volumes:
      - wordpress:/var/www/html
    networks:
      - app-network
    ports:
      - "8080:80"

To collect metrics and traces we also add the Splunk OpenTelemetry collector container.

We provide the configuration for the collector via a bind mounted file: collector.yaml. We will discuss the contents of this file later on.

  splunk-otel-collector:
    image: quay.io/signalfx/splunk-otel-collector:latest
    env_file: .env
    environment:
      - SPLUNK_ACCESS_TOKEN=${SPLUNK_ACCESS_TOKEN}
      - SPLUNK_REALM=${SPLUNK_REALM}
      - SPLUNK_CONFIG=/etc/collector.yaml
    volumes:
      - ./collector.yaml:/etc/collector.yaml
    ports:
      - "13133:13133"
      - "14250:14250"
      - "14268:14268"
      - "4317:4317"
      - "6060:6060"
      - "8888:8888"
      - "9080:9080"
      - "9411:9411"
      - "9943:9943"
      - "23456"
    networks:
      - app-network

To make our WordPress config and extensions and database persistent we need to add volumes. And to enable communication between the containers we add a network:

volumes:
  wordpress:
  dbdata:

networks:
  app-network:

Splunk OpenTelemetry Collector Configuration

We start with a minimal configuration based on the default config of the Splunk Opentelemetry Collector.

extensions:
  smartagent:
    bundleDir: ${SPLUNK_BUNDLE_DIR}
    collectd:
      configDir: ${SPLUNK_COLLECTD_DIR}
  memory_ballast:
    size_in_percentage: 33
receivers:
  zipkin:
    endpoint: 0.0.0.0:9411
  prometheus/internal:
    config:
      scrape_configs:
      - job_name: otel-collector
        scrape_interval: 10s
        static_configs:
        - targets:
          - 0.0.0.0:8888
        metric_relabel_configs:
        - source_labels:
          - __name__
          regex: .*grpc_io.*
          action: drop
  hostmetrics:
    collection_interval: 10s
    scrapers:
      cpu:
      disk:
      filesystem:
      memory:
      network:
      load:
      paging:
      processes:
processors:
  batch:
  memory_limiter:
    check_interval: 2s
    limit_mib: ${SPLUNK_MEMORY_LIMIT_MIB}
  resourcedetection:
    detectors:
    - ec2
    - gce
    - azure
    - ecs
    - system
    override: false
  resourcedetection/internal:
    detectors:
    - ec2
    - gce
    - azure
    - ecs
    - system
    override: true
  resource/add_environment:
    attributes:
    - action: insert
      value: production
      key: deployment.environment
exporters:
  sapm:
    access_token: ${SPLUNK_ACCESS_TOKEN}
    endpoint: ${SPLUNK_TRACE_URL}
  signalfx:
    access_token: ${SPLUNK_ACCESS_TOKEN}
    api_url: ${SPLUNK_API_URL}
    ingest_url: ${SPLUNK_INGEST_URL}
    sync_host_metadata: true
    correlation:
service:
  extensions:
  - memory_ballast
  pipelines:
    metrics:
      receivers:
      - hostmetrics
      processors:
      - memory_limiter
      - batch
      - resourcedetection
      exporters:
      - signalfx
    metrics/internal:
      receivers:
      - prometheus/internal
      processors:
      - memory_limiter
      - batch
      - resourcedetection/internal
      exporters:
      - signalfx
    traces:
      receivers:
      - zipkin
      processors:
      - memory_limiter
      - batch
      - resourcedetection
      - resource/add_environment
      exporters:
      - sapm
      - signalfx

To be able to identify this installation, we update the deployment.environment to uniquely identify this WordPress installation. In our case we remove production and set it to wordpress, by updating the value in this processor:

  resource/add_environment:
    attributes:
      - action: insert
        value: wordpress
        key: deployment.environment

First start

To prepare our environment we run make. This will download the PHP auto tracing library, and build our custom WordPress container.

We can start our environment:

% docker-compose up

When we open our browser to http://localhost:8080/ we are greeted with the WordPress installer. screenshot of wordpress installer We follow the installer, and add some default configurations. When done we log in using the credentials created in the previous step. We don't like the default URL scheme so we change it to postname. Navigate to:

Administration -> Settings -> Permalinks

screenshot of wordpress settings screen

While we browse around the wordpress site. We are already sending traces. Check in APM to see what we've already got: screenshot of splunk APM showing the service map

And we are also getting metrics from the host running our containers: screenshot of splunk IM dashboard

Refine our collector config

Traces

Break down by URL

We already get quite some insight into our monolith. But wouldn't it be nice if we can break down performance by URL? We already have got that data, the visited URLs are added to the traces as the attribute http.url.

We can configure Splunk APM to add a break-down for this attribute. https://docs.splunk.com/observability/apm/span-tags/cmms.html.

In this case, I don't want to change anything in our APM configuration, so I will solve it in the collector configuration. The OpenTelemetry collector has very powerful features to manipulate data. In this case, we need to change the span name to the contents of the attribute http.url. We add the following processor:

  span/endpoint:
    name:
      from_attributes: [http.url]

And add the processor to the traces service:

    traces:
      receivers:
        - zipkin
      processors:
        - memory_limiter
        - batch
        - resourcedetection
        - resource/add_environment
        - span/endpoint
      exporters:
        - sapm
        - signalfx

Now we can inspect our performance by URL: screenshot of tag explorer in Splunk APM

Break down by database operation or table.

In the trace view you might already have seen we collect all sql queries. Wouldn't it be great if we can also break down our performance by database table, or operation?

To do this we again use the processing capabilities of the open telemetry processors. This time it is a bit more complicated, as we need to extract the operation and database table from the query, and put it into a new attribute.

We extract the db operation using this regex: (?ms)(?P<dboperation>\w+). This is a very naive regex to simply get the first word of the query. Which works surprisingly well.

Extracting the tables is a bit trickier when JOIN's are involved. For the sake of simplicity I only extract the base table, and exclude any table joined to it for now. The regex to grab the word after FROM is (?ms).*FROM\s?'(?P<dbtable>[^\s']+).

These regexes use named capturing groups to extract data. dboperation and dbtable. The attributes we need are named db.operation and db.sql.table. Unfortunately regex does not support a dot in the name of a capturing group. So we need to rename the extracted attribute.

Renaming attributes in the OpenTelemetry collector is done in two steps. Duplicate the attribute to the new name. And delete the original attribute.

This brings us to the following configuration:

  attributes/db:
    include:
      match_type: strict
      attributes:
        - key: component
          value: mysqli
    actions:
      - key: resource.name
        pattern: (?ms)(?P\w+)
        action: extract
      - key: db.operation
        from_attribute: dboperation
        action: insert
      - key: dboperation
        action: delete
      - key: resource.name
        pattern: (?ms).*FROM\s`?(?P[^\s`]+)
        action: extract
      - key: db.sql.table
        from_attribute: dbtable
        action: insert
      - key: dbtable
        action: delete

You see the extract, insert, delete action twice. For both the table and operation. At the top you see we restrict these transformations to traces that have the attribute component set to mysqli. This way we are sure we only manipulate database-related traces.

We also need to add this processor to the traces service:

    traces:
      receivers:
        - zipkin
      processors:
        - memory_limiter
        - batch
        - resourcedetection
        - resource/add_environment
        - attributes/db
        - span/endpoint
      exporters:
        - sapm
        - signalfx

Restarting the OpenTelemetry collector will show we now have nice database breakdowns available: screenshot of tag explorer in Splunk APM screenshot of db performance in Splunk APM

Mysql metrics

Now that we have quite some insight into what WordPress and the database are doing, wouldn't it be nice to see some more details of our database?

We can get that by easily adding a receiver for MySQL to our collector config:

  smartagent/mysql:
    type: collectd/mysql
    host: db
    port: 3306
    databases:
      - name: wordpress
        username: ${MYSQL_USER}
        password: ${MYSQL_PASSWORD}
    extraDimensions:
      service.name: mysql-service

And add this receiver to our metrics pipeline service:

  pipelines:
    metrics:
      receivers:
        - hostmetrics
        - smartagent/mysql

This makes the default MySQL dashboard light up: screenshot defaul mysql dashbaord in Splunk IM

Final result

Now we have fully instrumented our WordPress monolith with modern tooling, without any changes to the WordPress application itself!

Amongst other things, we can see latency per URL, inspect slow queries, see which URLs map to which queries, and inspect our most important database and host metrics. Pretty impressive if you ask me!

Further improvements

All techniques mentioned in this article are done without any changes to the WordPress or template code. If needed we could of course go further. We could add our Real User Monitoring to the front end, so we get deep insight into how the pages render on the client devices. This could point us to performance problems and bugs in the javascript and CSS used by the WordPress templates or posts.

To further instrument the backend we can add manual tracing statements to our WordPress plugins. This will show us in great detail which plugin contributes how much to the total back end loading time.

This article was originally written for and published as a Splunk Community Blog