702 lines
33 KiB
ReStructuredText
702 lines
33 KiB
ReStructuredText
|
.. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>`
|
||
|
.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`
|
||
|
|
||
|
=======================
|
||
|
CPU Performance Scaling
|
||
|
=======================
|
||
|
|
||
|
::
|
||
|
|
||
|
Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
||
|
|
||
|
The Concept of CPU Performance Scaling
|
||
|
======================================
|
||
|
|
||
|
The majority of modern processors are capable of operating in a number of
|
||
|
different clock frequency and voltage configurations, often referred to as
|
||
|
Operating Performance Points or P-states (in ACPI terminology). As a rule,
|
||
|
the higher the clock frequency and the higher the voltage, the more instructions
|
||
|
can be retired by the CPU over a unit of time, but also the higher the clock
|
||
|
frequency and the higher the voltage, the more energy is consumed over a unit of
|
||
|
time (or the more power is drawn) by the CPU in the given P-state. Therefore
|
||
|
there is a natural tradeoff between the CPU capacity (the number of instructions
|
||
|
that can be executed over a unit of time) and the power drawn by the CPU.
|
||
|
|
||
|
In some situations it is desirable or even necessary to run the program as fast
|
||
|
as possible and then there is no reason to use any P-states different from the
|
||
|
highest one (i.e. the highest-performance frequency/voltage configuration
|
||
|
available). In some other cases, however, it may not be necessary to execute
|
||
|
instructions so quickly and maintaining the highest available CPU capacity for a
|
||
|
relatively long time without utilizing it entirely may be regarded as wasteful.
|
||
|
It also may not be physically possible to maintain maximum CPU capacity for too
|
||
|
long for thermal or power supply capacity reasons or similar. To cover those
|
||
|
cases, there are hardware interfaces allowing CPUs to be switched between
|
||
|
different frequency/voltage configurations or (in the ACPI terminology) to be
|
||
|
put into different P-states.
|
||
|
|
||
|
Typically, they are used along with algorithms to estimate the required CPU
|
||
|
capacity, so as to decide which P-states to put the CPUs into. Of course, since
|
||
|
the utilization of the system generally changes over time, that has to be done
|
||
|
repeatedly on a regular basis. The activity by which this happens is referred
|
||
|
to as CPU performance scaling or CPU frequency scaling (because it involves
|
||
|
adjusting the CPU clock frequency).
|
||
|
|
||
|
|
||
|
CPU Performance Scaling in Linux
|
||
|
================================
|
||
|
|
||
|
The Linux kernel supports CPU performance scaling by means of the ``CPUFreq``
|
||
|
(CPU Frequency scaling) subsystem that consists of three layers of code: the
|
||
|
core, scaling governors and scaling drivers.
|
||
|
|
||
|
The ``CPUFreq`` core provides the common code infrastructure and user space
|
||
|
interfaces for all platforms that support CPU performance scaling. It defines
|
||
|
the basic framework in which the other components operate.
|
||
|
|
||
|
Scaling governors implement algorithms to estimate the required CPU capacity.
|
||
|
As a rule, each governor implements one, possibly parametrized, scaling
|
||
|
algorithm.
|
||
|
|
||
|
Scaling drivers talk to the hardware. They provide scaling governors with
|
||
|
information on the available P-states (or P-state ranges in some cases) and
|
||
|
access platform-specific hardware interfaces to change CPU P-states as requested
|
||
|
by scaling governors.
|
||
|
|
||
|
In principle, all available scaling governors can be used with every scaling
|
||
|
driver. That design is based on the observation that the information used by
|
||
|
performance scaling algorithms for P-state selection can be represented in a
|
||
|
platform-independent form in the majority of cases, so it should be possible
|
||
|
to use the same performance scaling algorithm implemented in exactly the same
|
||
|
way regardless of which scaling driver is used. Consequently, the same set of
|
||
|
scaling governors should be suitable for every supported platform.
|
||
|
|
||
|
However, that observation may not hold for performance scaling algorithms
|
||
|
based on information provided by the hardware itself, for example through
|
||
|
feedback registers, as that information is typically specific to the hardware
|
||
|
interface it comes from and may not be easily represented in an abstract,
|
||
|
platform-independent way. For this reason, ``CPUFreq`` allows scaling drivers
|
||
|
to bypass the governor layer and implement their own performance scaling
|
||
|
algorithms. That is done by the |intel_pstate| scaling driver.
|
||
|
|
||
|
|
||
|
``CPUFreq`` Policy Objects
|
||
|
==========================
|
||
|
|
||
|
In some cases the hardware interface for P-state control is shared by multiple
|
||
|
CPUs. That is, for example, the same register (or set of registers) is used to
|
||
|
control the P-state of multiple CPUs at the same time and writing to it affects
|
||
|
all of those CPUs simultaneously.
|
||
|
|
||
|
Sets of CPUs sharing hardware P-state control interfaces are represented by
|
||
|
``CPUFreq`` as |struct cpufreq_policy| objects. For consistency,
|
||
|
|struct cpufreq_policy| is also used when there is only one CPU in the given
|
||
|
set.
|
||
|
|
||
|
The ``CPUFreq`` core maintains a pointer to a |struct cpufreq_policy| object for
|
||
|
every CPU in the system, including CPUs that are currently offline. If multiple
|
||
|
CPUs share the same hardware P-state control interface, all of the pointers
|
||
|
corresponding to them point to the same |struct cpufreq_policy| object.
|
||
|
|
||
|
``CPUFreq`` uses |struct cpufreq_policy| as its basic data type and the design
|
||
|
of its user space interface is based on the policy concept.
|
||
|
|
||
|
|
||
|
CPU Initialization
|
||
|
==================
|
||
|
|
||
|
First of all, a scaling driver has to be registered for ``CPUFreq`` to work.
|
||
|
It is only possible to register one scaling driver at a time, so the scaling
|
||
|
driver is expected to be able to handle all CPUs in the system.
|
||
|
|
||
|
The scaling driver may be registered before or after CPU registration. If
|
||
|
CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to
|
||
|
take a note of all of the already registered CPUs during the registration of the
|
||
|
scaling driver. In turn, if any CPUs are registered after the registration of
|
||
|
the scaling driver, the ``CPUFreq`` core will be invoked to take note of them
|
||
|
at their registration time.
|
||
|
|
||
|
In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it
|
||
|
has not seen so far as soon as it is ready to handle that CPU. [Note that the
|
||
|
logical CPU may be a physical single-core processor, or a single core in a
|
||
|
multicore processor, or a hardware thread in a physical processor or processor
|
||
|
core. In what follows "CPU" always means "logical CPU" unless explicitly stated
|
||
|
otherwise and the word "processor" is used to refer to the physical part
|
||
|
possibly including multiple logical CPUs.]
|
||
|
|
||
|
Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set
|
||
|
for the given CPU and if so, it skips the policy object creation. Otherwise,
|
||
|
a new policy object is created and initialized, which involves the creation of
|
||
|
a new policy directory in ``sysfs``, and the policy pointer corresponding to
|
||
|
the given CPU is set to the new policy object's address in memory.
|
||
|
|
||
|
Next, the scaling driver's ``->init()`` callback is invoked with the policy
|
||
|
pointer of the new CPU passed to it as the argument. That callback is expected
|
||
|
to initialize the performance scaling hardware interface for the given CPU (or,
|
||
|
more precisely, for the set of CPUs sharing the hardware interface it belongs
|
||
|
to, represented by its policy object) and, if the policy object it has been
|
||
|
called for is new, to set parameters of the policy, like the minimum and maximum
|
||
|
frequencies supported by the hardware, the table of available frequencies (if
|
||
|
the set of supported P-states is not a continuous range), and the mask of CPUs
|
||
|
that belong to the same policy (including both online and offline CPUs). That
|
||
|
mask is then used by the core to populate the policy pointers for all of the
|
||
|
CPUs in it.
|
||
|
|
||
|
The next major initialization step for a new policy object is to attach a
|
||
|
scaling governor to it (to begin with, that is the default scaling governor
|
||
|
determined by the kernel configuration, but it may be changed later
|
||
|
via ``sysfs``). First, a pointer to the new policy object is passed to the
|
||
|
governor's ``->init()`` callback which is expected to initialize all of the
|
||
|
data structures necessary to handle the given policy and, possibly, to add
|
||
|
a governor ``sysfs`` interface to it. Next, the governor is started by
|
||
|
invoking its ``->start()`` callback.
|
||
|
|
||
|
That callback it expected to register per-CPU utilization update callbacks for
|
||
|
all of the online CPUs belonging to the given policy with the CPU scheduler.
|
||
|
The utilization update callbacks will be invoked by the CPU scheduler on
|
||
|
important events, like task enqueue and dequeue, on every iteration of the
|
||
|
scheduler tick or generally whenever the CPU utilization may change (from the
|
||
|
scheduler's perspective). They are expected to carry out computations needed
|
||
|
to determine the P-state to use for the given policy going forward and to
|
||
|
invoke the scaling driver to make changes to the hardware in accordance with
|
||
|
the P-state selection. The scaling driver may be invoked directly from
|
||
|
scheduler context or asynchronously, via a kernel thread or workqueue, depending
|
||
|
on the configuration and capabilities of the scaling driver and the governor.
|
||
|
|
||
|
Similar steps are taken for policy objects that are not new, but were "inactive"
|
||
|
previously, meaning that all of the CPUs belonging to them were offline. The
|
||
|
only practical difference in that case is that the ``CPUFreq`` core will attempt
|
||
|
to use the scaling governor previously used with the policy that became
|
||
|
"inactive" (and is re-initialized now) instead of the default governor.
|
||
|
|
||
|
In turn, if a previously offline CPU is being brought back online, but some
|
||
|
other CPUs sharing the policy object with it are online already, there is no
|
||
|
need to re-initialize the policy object at all. In that case, it only is
|
||
|
necessary to restart the scaling governor so that it can take the new online CPU
|
||
|
into account. That is achieved by invoking the governor's ``->stop`` and
|
||
|
``->start()`` callbacks, in this order, for the entire policy.
|
||
|
|
||
|
As mentioned before, the |intel_pstate| scaling driver bypasses the scaling
|
||
|
governor layer of ``CPUFreq`` and provides its own P-state selection algorithms.
|
||
|
Consequently, if |intel_pstate| is used, scaling governors are not attached to
|
||
|
new policy objects. Instead, the driver's ``->setpolicy()`` callback is invoked
|
||
|
to register per-CPU utilization update callbacks for each policy. These
|
||
|
callbacks are invoked by the CPU scheduler in the same way as for scaling
|
||
|
governors, but in the |intel_pstate| case they both determine the P-state to
|
||
|
use and change the hardware configuration accordingly in one go from scheduler
|
||
|
context.
|
||
|
|
||
|
The policy objects created during CPU initialization and other data structures
|
||
|
associated with them are torn down when the scaling driver is unregistered
|
||
|
(which happens when the kernel module containing it is unloaded, for example) or
|
||
|
when the last CPU belonging to the given policy in unregistered.
|
||
|
|
||
|
|
||
|
Policy Interface in ``sysfs``
|
||
|
=============================
|
||
|
|
||
|
During the initialization of the kernel, the ``CPUFreq`` core creates a
|
||
|
``sysfs`` directory (kobject) called ``cpufreq`` under
|
||
|
:file:`/sys/devices/system/cpu/`.
|
||
|
|
||
|
That directory contains a ``policyX`` subdirectory (where ``X`` represents an
|
||
|
integer number) for every policy object maintained by the ``CPUFreq`` core.
|
||
|
Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links
|
||
|
under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer
|
||
|
that may be different from the one represented by ``X``) for all of the CPUs
|
||
|
associated with (or belonging to) the given policy. The ``policyX`` directories
|
||
|
in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific
|
||
|
attributes (files) to control ``CPUFreq`` behavior for the corresponding policy
|
||
|
objects (that is, for all of the CPUs associated with them).
|
||
|
|
||
|
Some of those attributes are generic. They are created by the ``CPUFreq`` core
|
||
|
and their behavior generally does not depend on what scaling driver is in use
|
||
|
and what scaling governor is attached to the given policy. Some scaling drivers
|
||
|
also add driver-specific attributes to the policy directories in ``sysfs`` to
|
||
|
control policy-specific aspects of driver behavior.
|
||
|
|
||
|
The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/`
|
||
|
are the following:
|
||
|
|
||
|
``affected_cpus``
|
||
|
List of online CPUs belonging to this policy (i.e. sharing the hardware
|
||
|
performance scaling interface represented by the ``policyX`` policy
|
||
|
object).
|
||
|
|
||
|
``bios_limit``
|
||
|
If the platform firmware (BIOS) tells the OS to apply an upper limit to
|
||
|
CPU frequencies, that limit will be reported through this attribute (if
|
||
|
present).
|
||
|
|
||
|
The existence of the limit may be a result of some (often unintentional)
|
||
|
BIOS settings, restrictions coming from a service processor or another
|
||
|
BIOS/HW-based mechanisms.
|
||
|
|
||
|
This does not cover ACPI thermal limitations which can be discovered
|
||
|
through a generic thermal driver.
|
||
|
|
||
|
This attribute is not present if the scaling driver in use does not
|
||
|
support it.
|
||
|
|
||
|
``cpuinfo_cur_freq``
|
||
|
Current frequency of the CPUs belonging to this policy as obtained from
|
||
|
the hardware (in KHz).
|
||
|
|
||
|
This is expected to be the frequency the hardware actually runs at.
|
||
|
If that frequency cannot be determined, this attribute should not
|
||
|
be present.
|
||
|
|
||
|
``cpuinfo_max_freq``
|
||
|
Maximum possible operating frequency the CPUs belonging to this policy
|
||
|
can run at (in kHz).
|
||
|
|
||
|
``cpuinfo_min_freq``
|
||
|
Minimum possible operating frequency the CPUs belonging to this policy
|
||
|
can run at (in kHz).
|
||
|
|
||
|
``cpuinfo_transition_latency``
|
||
|
The time it takes to switch the CPUs belonging to this policy from one
|
||
|
P-state to another, in nanoseconds.
|
||
|
|
||
|
If unknown or if known to be so high that the scaling driver does not
|
||
|
work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`)
|
||
|
will be returned by reads from this attribute.
|
||
|
|
||
|
``related_cpus``
|
||
|
List of all (online and offline) CPUs belonging to this policy.
|
||
|
|
||
|
``scaling_available_governors``
|
||
|
List of ``CPUFreq`` scaling governors present in the kernel that can
|
||
|
be attached to this policy or (if the |intel_pstate| scaling driver is
|
||
|
in use) list of scaling algorithms provided by the driver that can be
|
||
|
applied to this policy.
|
||
|
|
||
|
[Note that some governors are modular and it may be necessary to load a
|
||
|
kernel module for the governor held by it to become available and be
|
||
|
listed by this attribute.]
|
||
|
|
||
|
``scaling_cur_freq``
|
||
|
Current frequency of all of the CPUs belonging to this policy (in kHz).
|
||
|
|
||
|
In the majority of cases, this is the frequency of the last P-state
|
||
|
requested by the scaling driver from the hardware using the scaling
|
||
|
interface provided by it, which may or may not reflect the frequency
|
||
|
the CPU is actually running at (due to hardware design and other
|
||
|
limitations).
|
||
|
|
||
|
Some architectures (e.g. ``x86``) may attempt to provide information
|
||
|
more precisely reflecting the current CPU frequency through this
|
||
|
attribute, but that still may not be the exact current CPU frequency as
|
||
|
seen by the hardware at the moment.
|
||
|
|
||
|
``scaling_driver``
|
||
|
The scaling driver currently in use.
|
||
|
|
||
|
``scaling_governor``
|
||
|
The scaling governor currently attached to this policy or (if the
|
||
|
|intel_pstate| scaling driver is in use) the scaling algorithm
|
||
|
provided by the driver that is currently applied to this policy.
|
||
|
|
||
|
This attribute is read-write and writing to it will cause a new scaling
|
||
|
governor to be attached to this policy or a new scaling algorithm
|
||
|
provided by the scaling driver to be applied to it (in the
|
||
|
|intel_pstate| case), as indicated by the string written to this
|
||
|
attribute (which must be one of the names listed by the
|
||
|
``scaling_available_governors`` attribute described above).
|
||
|
|
||
|
``scaling_max_freq``
|
||
|
Maximum frequency the CPUs belonging to this policy are allowed to be
|
||
|
running at (in kHz).
|
||
|
|
||
|
This attribute is read-write and writing a string representing an
|
||
|
integer to it will cause a new limit to be set (it must not be lower
|
||
|
than the value of the ``scaling_min_freq`` attribute).
|
||
|
|
||
|
``scaling_min_freq``
|
||
|
Minimum frequency the CPUs belonging to this policy are allowed to be
|
||
|
running at (in kHz).
|
||
|
|
||
|
This attribute is read-write and writing a string representing a
|
||
|
non-negative integer to it will cause a new limit to be set (it must not
|
||
|
be higher than the value of the ``scaling_max_freq`` attribute).
|
||
|
|
||
|
``scaling_setspeed``
|
||
|
This attribute is functional only if the `userspace`_ scaling governor
|
||
|
is attached to the given policy.
|
||
|
|
||
|
It returns the last frequency requested by the governor (in kHz) or can
|
||
|
be written to in order to set a new frequency for the policy.
|
||
|
|
||
|
|
||
|
Generic Scaling Governors
|
||
|
=========================
|
||
|
|
||
|
``CPUFreq`` provides generic scaling governors that can be used with all
|
||
|
scaling drivers. As stated before, each of them implements a single, possibly
|
||
|
parametrized, performance scaling algorithm.
|
||
|
|
||
|
Scaling governors are attached to policy objects and different policy objects
|
||
|
can be handled by different scaling governors at the same time (although that
|
||
|
may lead to suboptimal results in some cases).
|
||
|
|
||
|
The scaling governor for a given policy object can be changed at any time with
|
||
|
the help of the ``scaling_governor`` policy attribute in ``sysfs``.
|
||
|
|
||
|
Some governors expose ``sysfs`` attributes to control or fine-tune the scaling
|
||
|
algorithms implemented by them. Those attributes, referred to as governor
|
||
|
tunables, can be either global (system-wide) or per-policy, depending on the
|
||
|
scaling driver in use. If the driver requires governor tunables to be
|
||
|
per-policy, they are located in a subdirectory of each policy directory.
|
||
|
Otherwise, they are located in a subdirectory under
|
||
|
:file:`/sys/devices/system/cpu/cpufreq/`. In either case the name of the
|
||
|
subdirectory containing the governor tunables is the name of the governor
|
||
|
providing them.
|
||
|
|
||
|
``performance``
|
||
|
---------------
|
||
|
|
||
|
When attached to a policy object, this governor causes the highest frequency,
|
||
|
within the ``scaling_max_freq`` policy limit, to be requested for that policy.
|
||
|
|
||
|
The request is made once at that time the governor for the policy is set to
|
||
|
``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
|
||
|
policy limits change after that.
|
||
|
|
||
|
``powersave``
|
||
|
-------------
|
||
|
|
||
|
When attached to a policy object, this governor causes the lowest frequency,
|
||
|
within the ``scaling_min_freq`` policy limit, to be requested for that policy.
|
||
|
|
||
|
The request is made once at that time the governor for the policy is set to
|
||
|
``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
|
||
|
policy limits change after that.
|
||
|
|
||
|
``userspace``
|
||
|
-------------
|
||
|
|
||
|
This governor does not do anything by itself. Instead, it allows user space
|
||
|
to set the CPU frequency for the policy it is attached to by writing to the
|
||
|
``scaling_setspeed`` attribute of that policy.
|
||
|
|
||
|
``schedutil``
|
||
|
-------------
|
||
|
|
||
|
This governor uses CPU utilization data available from the CPU scheduler. It
|
||
|
generally is regarded as a part of the CPU scheduler, so it can access the
|
||
|
scheduler's internal data structures directly.
|
||
|
|
||
|
It runs entirely in scheduler context, although in some cases it may need to
|
||
|
invoke the scaling driver asynchronously when it decides that the CPU frequency
|
||
|
should be changed for a given policy (that depends on whether or not the driver
|
||
|
is capable of changing the CPU frequency from scheduler context).
|
||
|
|
||
|
The actions of this governor for a particular CPU depend on the scheduling class
|
||
|
invoking its utilization update callback for that CPU. If it is invoked by the
|
||
|
RT or deadline scheduling classes, the governor will increase the frequency to
|
||
|
the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn,
|
||
|
if it is invoked by the CFS scheduling class, the governor will use the
|
||
|
Per-Entity Load Tracking (PELT) metric for the root control group of the
|
||
|
given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_
|
||
|
LWN.net article for a description of the PELT mechanism). Then, the new
|
||
|
CPU frequency to apply is computed in accordance with the formula
|
||
|
|
||
|
f = 1.25 * ``f_0`` * ``util`` / ``max``
|
||
|
|
||
|
where ``util`` is the PELT number, ``max`` is the theoretical maximum of
|
||
|
``util``, and ``f_0`` is either the maximum possible CPU frequency for the given
|
||
|
policy (if the PELT number is frequency-invariant), or the current CPU frequency
|
||
|
(otherwise).
|
||
|
|
||
|
This governor also employs a mechanism allowing it to temporarily bump up the
|
||
|
CPU frequency for tasks that have been waiting on I/O most recently, called
|
||
|
"IO-wait boosting". That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag
|
||
|
is passed by the scheduler to the governor callback which causes the frequency
|
||
|
to go up to the allowed maximum immediately and then draw back to the value
|
||
|
returned by the above formula over time.
|
||
|
|
||
|
This governor exposes only one tunable:
|
||
|
|
||
|
``rate_limit_us``
|
||
|
Minimum time (in microseconds) that has to pass between two consecutive
|
||
|
runs of governor computations (default: 1000 times the scaling driver's
|
||
|
transition latency).
|
||
|
|
||
|
The purpose of this tunable is to reduce the scheduler context overhead
|
||
|
of the governor which might be excessive without it.
|
||
|
|
||
|
This governor generally is regarded as a replacement for the older `ondemand`_
|
||
|
and `conservative`_ governors (described below), as it is simpler and more
|
||
|
tightly integrated with the CPU scheduler, its overhead in terms of CPU context
|
||
|
switches and similar is less significant, and it uses the scheduler's own CPU
|
||
|
utilization metric, so in principle its decisions should not contradict the
|
||
|
decisions made by the other parts of the scheduler.
|
||
|
|
||
|
``ondemand``
|
||
|
------------
|
||
|
|
||
|
This governor uses CPU load as a CPU frequency selection metric.
|
||
|
|
||
|
In order to estimate the current CPU load, it measures the time elapsed between
|
||
|
consecutive invocations of its worker routine and computes the fraction of that
|
||
|
time in which the given CPU was not idle. The ratio of the non-idle (active)
|
||
|
time to the total CPU time is taken as an estimate of the load.
|
||
|
|
||
|
If this governor is attached to a policy shared by multiple CPUs, the load is
|
||
|
estimated for all of them and the greatest result is taken as the load estimate
|
||
|
for the entire policy.
|
||
|
|
||
|
The worker routine of this governor has to run in process context, so it is
|
||
|
invoked asynchronously (via a workqueue) and CPU P-states are updated from
|
||
|
there if necessary. As a result, the scheduler context overhead from this
|
||
|
governor is minimum, but it causes additional CPU context switches to happen
|
||
|
relatively often and the CPU P-state updates triggered by it can be relatively
|
||
|
irregular. Also, it affects its own CPU load metric by running code that
|
||
|
reduces the CPU idle time (even though the CPU idle time is only reduced very
|
||
|
slightly by it).
|
||
|
|
||
|
It generally selects CPU frequencies proportional to the estimated load, so that
|
||
|
the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of
|
||
|
1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute
|
||
|
corresponds to the load of 0, unless when the load exceeds a (configurable)
|
||
|
speedup threshold, in which case it will go straight for the highest frequency
|
||
|
it is allowed to use (the ``scaling_max_freq`` policy limit).
|
||
|
|
||
|
This governor exposes the following tunables:
|
||
|
|
||
|
``sampling_rate``
|
||
|
This is how often the governor's worker routine should run, in
|
||
|
microseconds.
|
||
|
|
||
|
Typically, it is set to values of the order of 10000 (10 ms). Its
|
||
|
default value is equal to the value of ``cpuinfo_transition_latency``
|
||
|
for each policy this governor is attached to (but since the unit here
|
||
|
is greater by 1000, this means that the time represented by
|
||
|
``sampling_rate`` is 1000 times greater than the transition latency by
|
||
|
default).
|
||
|
|
||
|
If this tunable is per-policy, the following shell command sets the time
|
||
|
represented by it to be 750 times as high as the transition latency::
|
||
|
|
||
|
# echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
|
||
|
|
||
|
``up_threshold``
|
||
|
If the estimated CPU load is above this value (in percent), the governor
|
||
|
will set the frequency to the maximum value allowed for the policy.
|
||
|
Otherwise, the selected frequency will be proportional to the estimated
|
||
|
CPU load.
|
||
|
|
||
|
``ignore_nice_load``
|
||
|
If set to 1 (default 0), it will cause the CPU load estimation code to
|
||
|
treat the CPU time spent on executing tasks with "nice" levels greater
|
||
|
than 0 as CPU idle time.
|
||
|
|
||
|
This may be useful if there are tasks in the system that should not be
|
||
|
taken into account when deciding what frequency to run the CPUs at.
|
||
|
Then, to make that happen it is sufficient to increase the "nice" level
|
||
|
of those tasks above 0 and set this attribute to 1.
|
||
|
|
||
|
``sampling_down_factor``
|
||
|
Temporary multiplier, between 1 (default) and 100 inclusive, to apply to
|
||
|
the ``sampling_rate`` value if the CPU load goes above ``up_threshold``.
|
||
|
|
||
|
This causes the next execution of the governor's worker routine (after
|
||
|
setting the frequency to the allowed maximum) to be delayed, so the
|
||
|
frequency stays at the maximum level for a longer time.
|
||
|
|
||
|
Frequency fluctuations in some bursty workloads may be avoided this way
|
||
|
at the cost of additional energy spent on maintaining the maximum CPU
|
||
|
capacity.
|
||
|
|
||
|
``powersave_bias``
|
||
|
Reduction factor to apply to the original frequency target of the
|
||
|
governor (including the maximum value used when the ``up_threshold``
|
||
|
value is exceeded by the estimated CPU load) or sensitivity threshold
|
||
|
for the AMD frequency sensitivity powersave bias driver
|
||
|
(:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000
|
||
|
inclusive.
|
||
|
|
||
|
If the AMD frequency sensitivity powersave bias driver is not loaded,
|
||
|
the effective frequency to apply is given by
|
||
|
|
||
|
f * (1 - ``powersave_bias`` / 1000)
|
||
|
|
||
|
where f is the governor's original frequency target. The default value
|
||
|
of this attribute is 0 in that case.
|
||
|
|
||
|
If the AMD frequency sensitivity powersave bias driver is loaded, the
|
||
|
value of this attribute is 400 by default and it is used in a different
|
||
|
way.
|
||
|
|
||
|
On Family 16h (and later) AMD processors there is a mechanism to get a
|
||
|
measured workload sensitivity, between 0 and 100% inclusive, from the
|
||
|
hardware. That value can be used to estimate how the performance of the
|
||
|
workload running on a CPU will change in response to frequency changes.
|
||
|
|
||
|
The performance of a workload with the sensitivity of 0 (memory-bound or
|
||
|
IO-bound) is not expected to increase at all as a result of increasing
|
||
|
the CPU frequency, whereas workloads with the sensitivity of 100%
|
||
|
(CPU-bound) are expected to perform much better if the CPU frequency is
|
||
|
increased.
|
||
|
|
||
|
If the workload sensitivity is less than the threshold represented by
|
||
|
the ``powersave_bias`` value, the sensitivity powersave bias driver
|
||
|
will cause the governor to select a frequency lower than its original
|
||
|
target, so as to avoid over-provisioning workloads that will not benefit
|
||
|
from running at higher CPU frequencies.
|
||
|
|
||
|
``conservative``
|
||
|
----------------
|
||
|
|
||
|
This governor uses CPU load as a CPU frequency selection metric.
|
||
|
|
||
|
It estimates the CPU load in the same way as the `ondemand`_ governor described
|
||
|
above, but the CPU frequency selection algorithm implemented by it is different.
|
||
|
|
||
|
Namely, it avoids changing the frequency significantly over short time intervals
|
||
|
which may not be suitable for systems with limited power supply capacity (e.g.
|
||
|
battery-powered). To achieve that, it changes the frequency in relatively
|
||
|
small steps, one step at a time, up or down - depending on whether or not a
|
||
|
(configurable) threshold has been exceeded by the estimated CPU load.
|
||
|
|
||
|
This governor exposes the following tunables:
|
||
|
|
||
|
``freq_step``
|
||
|
Frequency step in percent of the maximum frequency the governor is
|
||
|
allowed to set (the ``scaling_max_freq`` policy limit), between 0 and
|
||
|
100 (5 by default).
|
||
|
|
||
|
This is how much the frequency is allowed to change in one go. Setting
|
||
|
it to 0 will cause the default frequency step (5 percent) to be used
|
||
|
and setting it to 100 effectively causes the governor to periodically
|
||
|
switch the frequency between the ``scaling_min_freq`` and
|
||
|
``scaling_max_freq`` policy limits.
|
||
|
|
||
|
``down_threshold``
|
||
|
Threshold value (in percent, 20 by default) used to determine the
|
||
|
frequency change direction.
|
||
|
|
||
|
If the estimated CPU load is greater than this value, the frequency will
|
||
|
go up (by ``freq_step``). If the load is less than this value (and the
|
||
|
``sampling_down_factor`` mechanism is not in effect), the frequency will
|
||
|
go down. Otherwise, the frequency will not be changed.
|
||
|
|
||
|
``sampling_down_factor``
|
||
|
Frequency decrease deferral factor, between 1 (default) and 10
|
||
|
inclusive.
|
||
|
|
||
|
It effectively causes the frequency to go down ``sampling_down_factor``
|
||
|
times slower than it ramps up.
|
||
|
|
||
|
|
||
|
Frequency Boost Support
|
||
|
=======================
|
||
|
|
||
|
Background
|
||
|
----------
|
||
|
|
||
|
Some processors support a mechanism to raise the operating frequency of some
|
||
|
cores in a multicore package temporarily (and above the sustainable frequency
|
||
|
threshold for the whole package) under certain conditions, for example if the
|
||
|
whole chip is not fully utilized and below its intended thermal or power budget.
|
||
|
|
||
|
Different names are used by different vendors to refer to this functionality.
|
||
|
For Intel processors it is referred to as "Turbo Boost", AMD calls it
|
||
|
"Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on.
|
||
|
As a rule, it also is implemented differently by different vendors. The simple
|
||
|
term "frequency boost" is used here for brevity to refer to all of those
|
||
|
implementations.
|
||
|
|
||
|
The frequency boost mechanism may be either hardware-based or software-based.
|
||
|
If it is hardware-based (e.g. on x86), the decision to trigger the boosting is
|
||
|
made by the hardware (although in general it requires the hardware to be put
|
||
|
into a special state in which it can control the CPU frequency within certain
|
||
|
limits). If it is software-based (e.g. on ARM), the scaling driver decides
|
||
|
whether or not to trigger boosting and when to do that.
|
||
|
|
||
|
The ``boost`` File in ``sysfs``
|
||
|
-------------------------------
|
||
|
|
||
|
This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls
|
||
|
the "boost" setting for the whole system. It is not present if the underlying
|
||
|
scaling driver does not support the frequency boost mechanism (or supports it,
|
||
|
but provides a driver-specific interface for controlling it, like
|
||
|
|intel_pstate|).
|
||
|
|
||
|
If the value in this file is 1, the frequency boost mechanism is enabled. This
|
||
|
means that either the hardware can be put into states in which it is able to
|
||
|
trigger boosting (in the hardware-based case), or the software is allowed to
|
||
|
trigger boosting (in the software-based case). It does not mean that boosting
|
||
|
is actually in use at the moment on any CPUs in the system. It only means a
|
||
|
permission to use the frequency boost mechanism (which still may never be used
|
||
|
for other reasons).
|
||
|
|
||
|
If the value in this file is 0, the frequency boost mechanism is disabled and
|
||
|
cannot be used at all.
|
||
|
|
||
|
The only values that can be written to this file are 0 and 1.
|
||
|
|
||
|
Rationale for Boost Control Knob
|
||
|
--------------------------------
|
||
|
|
||
|
The frequency boost mechanism is generally intended to help to achieve optimum
|
||
|
CPU performance on time scales below software resolution (e.g. below the
|
||
|
scheduler tick interval) and it is demonstrably suitable for many workloads, but
|
||
|
it may lead to problems in certain situations.
|
||
|
|
||
|
For this reason, many systems make it possible to disable the frequency boost
|
||
|
mechanism in the platform firmware (BIOS) setup, but that requires the system to
|
||
|
be restarted for the setting to be adjusted as desired, which may not be
|
||
|
practical at least in some cases. For example:
|
||
|
|
||
|
1. Boosting means overclocking the processor, although under controlled
|
||
|
conditions. Generally, the processor's energy consumption increases
|
||
|
as a result of increasing its frequency and voltage, even temporarily.
|
||
|
That may not be desirable on systems that switch to power sources of
|
||
|
limited capacity, such as batteries, so the ability to disable the boost
|
||
|
mechanism while the system is running may help there (but that depends on
|
||
|
the workload too).
|
||
|
|
||
|
2. In some situations deterministic behavior is more important than
|
||
|
performance or energy consumption (or both) and the ability to disable
|
||
|
boosting while the system is running may be useful then.
|
||
|
|
||
|
3. To examine the impact of the frequency boost mechanism itself, it is useful
|
||
|
to be able to run tests with and without boosting, preferably without
|
||
|
restarting the system in the meantime.
|
||
|
|
||
|
4. Reproducible results are important when running benchmarks. Since
|
||
|
the boosting functionality depends on the load of the whole package,
|
||
|
single-thread performance may vary because of it which may lead to
|
||
|
unreproducible results sometimes. That can be avoided by disabling the
|
||
|
frequency boost mechanism before running benchmarks sensitive to that
|
||
|
issue.
|
||
|
|
||
|
Legacy AMD ``cpb`` Knob
|
||
|
-----------------------
|
||
|
|
||
|
The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to
|
||
|
the global ``boost`` one. It is used for disabling/enabling the "Core
|
||
|
Performance Boost" feature of some AMD processors.
|
||
|
|
||
|
If present, that knob is located in every ``CPUFreq`` policy directory in
|
||
|
``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called
|
||
|
``cpb``, which indicates a more fine grained control interface. The actual
|
||
|
implementation, however, works on the system-wide basis and setting that knob
|
||
|
for one policy causes the same value of it to be set for all of the other
|
||
|
policies at the same time.
|
||
|
|
||
|
That knob is still supported on AMD processors that support its underlying
|
||
|
hardware feature, but it may be configured out of the kernel (via the
|
||
|
:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global
|
||
|
``boost`` knob is present regardless. Thus it is always possible use the
|
||
|
``boost`` knob instead of the ``cpb`` one which is highly recommended, as that
|
||
|
is more consistent with what all of the other systems do (and the ``cpb`` knob
|
||
|
may not be supported any more in the future).
|
||
|
|
||
|
The ``cpb`` knob is never present for any processors without the underlying
|
||
|
hardware feature (e.g. all Intel ones), even if the
|
||
|
:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set.
|
||
|
|
||
|
|
||
|
.. _Per-entity load tracking: https://lwn.net/Articles/531853/
|