Monitor simulation
==================
Flower allows you to monitor system resources while running your simulation. Moreover,
the Flower simulation engine is powerful and enables you to decide how to allocate
resources per client manner and constrain the total usage. Insights from resource
consumption can help you make smarter decisions and speed up the execution time.
The specific instructions assume you are using macOS and have the `Homebrew
`_ package manager installed.
Downloads
---------
.. code-block:: bash
brew install prometheus grafana
`Prometheus `_ is used for data collection, while `Grafana
`_ will enable you to visualize the collected data. They are both
well integrated with `Ray `_ which Flower uses under the hood.
Overwrite the configuration files (depending on your device, it might be installed on a
different path).
If you are on an M1 Mac, it should be:
.. code-block:: bash
/opt/homebrew/etc/prometheus.yml
/opt/homebrew/etc/grafana/grafana.ini
On the previous generation Intel Mac devices, it should be:
.. code-block:: bash
/usr/local/etc/prometheus.yml
/usr/local/etc/grafana/grafana.ini
Open the respective configuration files and change them. Depending on your device, use
one of the two following commands:
.. code-block:: bash
# M1 macOS
open /opt/homebrew/etc/prometheus.yml
# Intel macOS
open /usr/local/etc/prometheus.yml
and then delete all the text in the file and paste a new Prometheus config you see
below. You may adjust the time intervals to your requirements:
.. code-block:: bash
global:
scrape_interval: 1s
evaluation_interval: 1s
scrape_configs:
# Scrape from each ray node as defined in the service_discovery.json provided by ray.
- job_name: 'ray'
file_sd_configs:
- files:
- '/tmp/ray/prom_metrics_service_discovery.json'
Now after you have edited the Prometheus configuration, do the same with the Grafana
configuration files. Open those using one of the following commands as before:
.. code-block:: python
# M1 macOS
open / opt / homebrew / etc / grafana / grafana.ini
# Intel macOS
open / usr / local / etc / grafana / grafana.ini
Your terminal editor should open and allow you to apply the following configuration as
before.
.. code-block:: bash
[security]
allow_embedding = true
[auth.anonymous]
enabled = true
org_name = Main Org.
org_role = Viewer
[paths]
provisioning = /tmp/ray/session_latest/metrics/grafana/provisioning
Congratulations, you just downloaded all the necessary software needed for metrics
tracking. Now, let’s start it.
Tracking metrics
----------------
Before running your Flower simulation, you have to start the monitoring tools you have
just installed and configured.
.. code-block:: bash
brew services start prometheus
brew services start grafana
Please include the following argument in your Python code when starting a simulation.
.. code-block:: python
fl.simulation.start_simulation(
# ...
# all the args you used before
# ...
ray_init_args={"include_dashboard": True}
)
Now, you are ready to start your workload.
Shortly after the simulation starts, you should see the following logs in your terminal:
.. code-block:: bash
2023-01-20 16:22:58,620 INFO [worker.py:1529](http://worker.py:1529/) -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
You can look at everything at http://127.0.0.1:8265 .
It's a Ray Dashboard. You can navigate to Metrics (on the left panel, the lowest
option).
Or alternatively, you can just see them in Grafana by clicking on the right-up corner,
“View in Grafana”. Please note that the Ray dashboard is only accessible during the
simulation. After the simulation ends, you can only use Grafana to explore the metrics.
You can start Grafana by going to ``http://localhost:3000/``.
After you finish the visualization, stop Prometheus and Grafana. This is important as
they will otherwise block, for example port ``3000`` on your machine as long as they are
running.
.. code-block:: bash
brew services stop prometheus
brew services stop grafana
Resource allocation
-------------------
You must understand how the Ray library works to efficiently allocate system resources
to simulation clients on your own.
Initially, the simulation (which Ray handles under the hood) starts by default with all
the available resources on the system, which it shares among the clients. It doesn't
mean it divides it equally among all of them, nor that the model training happens at all
of them simultaneously. You will learn more about that in the later part of this blog.
You can check the system resources by running the following:
.. code-block:: python
import ray
ray.available_resources()
In Google Colab, the result you see might be similar to this:
.. code-block:: bash
{'memory': 8020104807.0,
'GPU': 1.0,
'object_store_memory': 4010052403.0,
'CPU': 2.0,
'accelerator_type:T4': 1.0,
'node:172.28.0.2': 1.0}
However, you can overwrite the defaults. When starting a simulation, do the following
(you don't need to overwrite all of them):
.. code-block:: python
num_cpus = 2
num_gpus = 1
ram_memory = 16_000 * 1024 * 1024 # 16 GB
fl.simulation.start_simulation(
# ...
# all the args you were specifying before
# ...
ray_init_args={
"include_dashboard": True, # we need this one for tracking
"num_cpus": num_cpus,
"num_gpus": num_gpus,
"memory": ram_memory,
}
)
Let’s also specify the resource for a single client.
.. code-block:: python
# Total resources for simulation
num_cpus = 4
num_gpus = 1
ram_memory = 16_000 * 1024 * 1024 # 16 GB
# Single client resources
client_num_cpus = 2
client_num_gpus = 1
fl.simulation.start_simulation(
# ...
# all the args you were specifying before
# ...
ray_init_args={
"include_dashboard": True, # we need this one for tracking
"num_cpus": num_cpus,
"num_gpus": num_gpus,
"memory": ram_memory,
},
# The argument below is new
client_resources={
"num_cpus": client_num_cpus,
"num_gpus": client_num_gpus,
},
)
Now comes the crucial part. Ray will start a new client only when it has all the
required resources (such that they run in parallel) when the resources allow.
In the example above, only one client will be run, so your clients won't run
concurrently. Setting ``client_num_gpus = 0.5`` would allow running two clients and
therefore enable them to run concurrently. Be careful not to require more resources than
available. If you specified ``client_num_gpus = 2``, the simulation wouldn't start (even
if you had 2 GPUs but decided to set 1 in ``ray_init_args``).
FAQ
---
Q: I don't see any metrics logged.
A: The timeframe might not be properly set. The setting is in the top right corner
("Last 30 minutes" by default). Please change the timeframe to reflect the period when
the simulation was running.
Q: I see “Grafana server not detected. Please make sure the Grafana server is running
and refresh this page” after going to the Metrics tab in Ray Dashboard.
A: You probably don't have Grafana running. Please check the running services
.. code-block:: bash
brew services list
Q: I see "This site can't be reached" when going to http://127.0.0.1:8265.
A: Either the simulation has already finished, or you still need to start Prometheus.
Resources
---------
Ray Dashboard: https://docs.ray.io/en/latest/ray-observability/getting-started.html
Ray Metrics: https://docs.ray.io/en/latest/cluster/metrics.html