I Introduction
With the emergence and proliferation of services and applications such as cloud computing and social networks, data center (DC) plays an ever important role. It is predicted that global DC IP traffic will grow 3fold from 2014 to 2019 with a compound annual growth rate (CAGR) of 25 percent [1]. At the same time, the high energy consumption of DCs is drawing more and more attention due to economic, social, and environmental concerns. DC electricity consumption in the U.S. alone is projected to increase to roughly 140 billion kilowatthours annually by 2020, costing $13 billion in electricity bills and emitting nearly 100 million tons of carbon pollution per year [2]. In this paper, we focus on one of the significant sources of energy consumption in DC (about 38% [3]), the cooling energy.
Cooling energy optimization involves the control of a sophisticated cooling system, which consists of multiple components, such as cooling tower, chiller, and ventilation system, etc. A common practice of DC cooling system control is to adjust the setpoints, i.e., the target values of different control variables. For example, by setting the temperature control variable at the outlet of an air conditioner to the desired value, the air conditioner can adjust its internal state to meet the setpoint by consuming a certain amount of energy. An optimal selection of these setpoints can be challenging, as the process relies on the knowledge of the cooling system, from thermal dynamics to mechanics. Many existing approaches are based on an approximated system model that often incorporates the firstorder effects of thermal, electrical, and mechanical principles [4, 5, 6, 7, 8]. These approximated models are sometimes either inadequate or inaccurate to capture the intricacies of various interacting processes of DC cooling operations, leading to suboptimal or unstable cooling controls. Recently, the learningbased approach has emerged as an attractive alternative. A learningbased approach does not assume any specific model of the underlying system. Instead, the control policies are learned and derived from the massive data collected on the system status and energy consumption [9]. This approach is especially advantageous when the complexity of the underlying system makes an accurate mathematically modeling a daunting task.
Currently, the general cooling control optimization approach, which includes a model building stage and a solving stage, can be referred to as the twostage (TS) approach. In comparison, an endtoend or onestage approach uses directly the unprocessed and often highdimensional input to learn a control policy, which can be used to determine the control setting given the system input. One such framework to our interest is reinforcement learning (RL) [10], in which neural based control agent has been introduced decades ago [11]. Recently algorithms that combine RL with deep learning (DL), such as deep Qnetwork (DQN), have been successfully applied to the task of training AI agent to play video games at the human performance level with only raw pixel inputs, thus showing the potential of the endtoend approach [12]. The continuous domain extension of DQN, called deep deterministic policy gradient (DDPG), has also shown promising results on simulated physical control tasks [13]. However, DDPG has not been widely studied in the context of a more realistic and complex control optimization, such as cooling system optimization for DCs (the subject of this paper). It remains to be demonstrated if an endtoend approach can achieve similar or better control performance compared with the TS approaches. Besides, DDPG is a simulationbased algorithm. Like many RL algorithms, it uses an excessive amount of computation and possibly very long training time. Thus it is still interesting to see how the challenges are met. Note that Google claims that they use AI method [14] to reduce the PUE of their DC yet no detailed methodology or performance evaluation results are disclosed. In our previous work, we reviewed the DC energy cost models [15] and the existed cooling optimization approaches [16], and conducted several DC power analysis and control studies [17] [18] [19]; based on which we believe that a datadriven learningbased optimization method is needed for DC energy optimization, which can be used to achieve optimization effects with minimum human innervations and reduce the DC management difficulty.
In this paper, we propose an endtoend approach for DC cooling control optimization and evaluate the algorithm from various aspects. We develop a cooling control algorithm adapted from the DDPG and the actorcritic architecture [13] [20]. Our proposed algorithm is offpolicy and offline, as it can be trained with a precollected trace to learn and improve the control policy. Besides the standard version of the algorithm which makes control decisions based on the current state, we also test the recurrent version of the algorithm which can perform better when the data are noisy. We also evaluate different algorithm implementation details such as neural network architecture and hyperparameters to examine the approach thoroughly.
To test the proposed algorithm, we use the EnergyPlus [21] platform to build a test case; besides the simulation case, we also collect a real data trace from the National Super Computing Centre (NSCC) of Singapore and test our algorithm on it. For the simulation test case, we control five different setpoints to achieve minimum PUE and to maintain the temperature of the DC zone within a predefined range. The results of the proposed algorithms are compared with those generated by a standard twostage control algorithm and the default setpoint based control algorithm (embedded in the simulation software). The results indicate that the proposed algorithm not only successfully maintains the temperature of the DC zone within a predefined range under varying workload and weather conditions, but also achieves lower PUE and save about 11% cooling cost compared with the baseline algorithm. For the real data from NSCC test case, we focus on optimizing the airflow rate setting of the three precision cooling units (PCUs) which are used to cool 26 racks. Our results show that the proposed algorithm can approximate the actual temperature with high accuracy (lower than 0.1 degrees) and can output control settings according to the cooling requirements with around 15% energy saving. The main contributions of the paper are summarized as follows.
First, we propose an endtoend and DRLbased framework that can be utilized for DC cooling control optimization. We propose an algorithm that trains the neural network with a precollected data trace, such that we can overcome the problem of high simulation time cost in simulationbased algorithms like DDPG. This approach is well suited for a practical DC equipped with a monitoring/sensing system that collects data in real time.
Second, we build a testbed with EnergyPlus and verify the approach based on the sophisticated simulation software. Our simulation results indicate that the proposed control algorithm can accomplish the cooling control tasks with about 11% cooling cost saving compared with the baseline approach.
Third, we propose a deunderestimate (DUE) solution for tracebased study in practice. The DUE method can be used to eliminate underestimation of the predicted temperature and thus drives to a more conservative and lowrisk energy saving calculation when a real test or decent simulation is not available. Such a method is useful as risk management is essential for DC operation.
In summary, we demonstrate the feasibility and effectiveness of applying an endtoend neural control algorithm to the DC cooling optimization. The evaluation of the performance of the proposed algorithm serves as the first step to build an intelligent DC management system that requires minimal manual intervention. Though this work is simulation and trace based, it does shed new light on the application of deep reinforcement learning (DRL) to practical DC control optimization, and application of DRL to other traditional industry areas.
Ii Related Works
Iia Recent Progress on DC Cooling Optimization
Cooling system control optimization problem has been examined from different aspects. A lot of these literatures [4] [5] [6] [7] [8] [9] [22] focus on using a twostage optimization procedure to optimize the cooling efficiency. For example, in the first stage, a thermal dynamics model is built to evaluate the efficiency, such as in [4] [5] [6] [7] [8] [22]. Recently in [23], the authors proposed utilizing computational fluid dynamics (CFD) model to analyze the airflow efficiency. These models can be very complex, such as in [4] the authors proposed a mathematical model for a specific cooling system which including 43 equations. With such complexity, this kind of model is hard to extrapolate to another DC with a separate configuration. Research has been trying to build data driven approach, such as the neural network model in [9]. In the second stage, the control variables are optimized via either an analytic optimization algorithm such as in [4] or a random global optimizer such as in [9]. Different to the existing TS approach, we propose to train a policy network in an offline manner such that we can avoid the optimization procedure during the decision making process.
Another area of research on cooling control optimization is to optimize the icestorage system [24] [25], so that it can be used for cooling when the electricity price is high. In addition to directly optimizing the cooling system surveyed above, there exists an extensive body of works that focus on the IT side of the DC to save cooling energy. For example, in [26] [27] [28], the workload dispatch problem was studied to optimize the thermal map in DC to improve cooling efficiency. In [29], the authors proposed a novel network flow management method to use fewer switches to save power. Our approach can be combined with IT side optimization to reduce the energy cost of DC further.
We note that most of the existing studies are based on ideal models in a simplified situation. In this paper, we propose a new DRL based solution and verify the approach on both complicated simulation system and a real data trace.
IiB Recent Progress on DRL
Reinforcement Learning [11] [30] deals with agents that learn to take better actions directly from experience of interacting with the environment. Recently the development and application of RL technologies have flourished. For example, in [31], Liu et al. proposed an adaptive dynamic programming method to do policy iteration for nonlinear systems; in [32], Luo et al. proposed a neural actorcritic RL solution to the control problem; in [33], Liu et al. proposed a single critic network based RL solution to the constrainedinput stabilizing controller; in [34], Modares et al. applied RL to the humanrobot interaction system which can minimize the human effort and optimize the control results; in [35], Pan et al. proposed a neural control algorithm that mimics the human motor learning; in [36]. Song et al. propose an offpolicy RL method to solve nonlinear nonzerosum games; in [37], Deng et al. build a financial trading agent based on deep neural networks.
In the flourish of RL studies and applications, deep reinforcement learning (DRL) [12] has shown its strength in various fields. The deep Qnetwork (DQN) proposed in [12] applies a neural network approximation to the Q table in Qlearning [38]. Subsequent studies on DQN have been focusing on improving the training stability of the framework such as in [39] and extending the framework to solve problems with continuous control variables [13]. Various applications of these deep reinforcement learning have been proposed such as video processing [40] and textbased game [41]. Yet DRL has not been verified in the practical control system like the cooling system in a DC, where high simulation cost can be trouble, and the robustness requirement is high. In this paper, we propose to adopt the DDPG algorithm for the cooling system control optimization problem and examine various implementation details related to the robustness of the algorithm.
Iii The Cooling Optimization Problem Formulation
To study the cooling control optimization problem, in this section we utilize a simulation model to present a cooling control optimization problem formulation. The simulation model is based on the widely adopted building energy simulation platform EnergyPlus [21]. Although the model is largely simplified, it does capture the major cooling dynamics and is thus adequate for studying the cooling control optimization.
Iiia Simulation System Model
The model is based on a simulation example provided by EnergyPlus. As illustrated in Fig. 1, the model consists of two server zones ( and ) and their associated cooling systems. The two zones are different in size, location, and their corresponding cooling systems. In the following, we describe the cooling system with a focus on identifying the state space (parameters that characterize the system), the action space (control variables), and the reward counters (optimization objectives), while omitting the details of the facility structures and operation processes that are unrelated to problem setup (yet might be critical parts of the overall system).
IiiA1 Data Center Model – State Space and Reward
The DC has two server zones placed side by side, and , with each zone being a standalone server room. The two zones are different in size ( 15.24*15.24 and 15.24*17.00 for and , respectively) but similar in other construction aspects. The heat in each zone is generated by IT equipment (ITE) and other sources (such as illuminations), with ITE as the dominant heat source. The load of the ITE is defined as , where is the designed load density (per square meter) and is a load factor that varies at different time slots. In our simulation, we use a public trace collected from Wikimedia [42] to set to be the same for and while using different load density 4kw and 2kw for and respectively. The heat generated by the illumination is assumed to be a constant (per square meter) as the lights inside the DC are on all the time. Also, there is heat generated by the human workers in the DC. This part varies according to the work schedule. A notable simplification of the simulation is that each data zone is modeled as a single pointheat source. This is less accurate compared to the CFD based thermal analysis. We leave a finergrained model for future work. We note that even with a finergrained model, the proposed framework and algorithm will remain the same, albeit with much larger state space and action space.
In the context of the RL framework, we use a tuple of workload level and the ambient temperature to represent the state, since both of them affect the cooling load. We use the tuple of PUE and the IT equipment outlet temperature of each zone to compute the reward. In the context of the DC cooling, PUE needs to be minimized and the outlet temperature needs to be kept within a specific range.
IiiA2 Cooling System Model: Action Space
In the target simulation model, and are equipped with different cooling systems: direct expansion (DX) cooling system for and chilled water (chiller) cooling system for . Both cooling systems are supplied with cool water from a cooling tower, but they use the cool water in different ways. In the DX system, the cool water passes through coils and cool down the airflow passing over the coils. In the chiller system, the cool water is used first to refrigerate another water stream (chilled water), which in turn cools down the airflow supplied to the DC.
The main components of these two cooling systems are shown in Fig. 2. In the DX cooling system, the intake ambient air flow is first cooled by two types of evaporative coolers: directive (DEC) and indirect (IEC), and then passes over the DX cooling coils and is further fed to the DC.
The underlying control algorithm in EnergyPlus (referred as the DefaultE+ control algorithm) uses the following five setpoints to control the cooling system: DEC outlet temperature (airflow) , IEC outlet temperature (airflow) , chilled water loop outlet temperature (water flow) , DX cooling coil outlet temperature (airflow) , and chiller cooling air loop outlet temperature (airflow) .
The DefaultE+ algorithm relies on a fixed zone temperature setpoint setting, and then compute the settings of the five setpoints based on the knowledge of the underlying system dynamics with the load and weather information. In our proposed learning algorithm, the same five setpoints are used as control variables, but, on the contrary, they are learned from the precollected data trace. Neither physical meaning of these variables nor the relationship information among them will be used in training.
IiiB Problem Statement
We formulate the cooling control optimization problem as follows. We are given a timevarying tuple of the ambient air temperature and the load factor . The problem is to determine the values of five control setpoints to minimize the objective function as stated in Eq. (1).
(1)  
s.t.  
The objective function aims to strike a balance between minimizing the PUE and preventing overheating in the server zone. In particular, the objective function consists of two parts: the first part is PUE (denoted as ), which is to be minimized; the second part accounts for the penalty of the overheating (for both and ). The penalty function has the form of , for , with , , and denoting the penalty pricing factor, average IT equipment outlet temperature of zone
, and overheating threshold, respectively. The penalty term takes the standard form of the soft plus activation function which has been implemented in most deep learning frameworks thus easy to be implemented. During the training, we minimize this cost function (reverse of the reward), since commonly the training optimization algorithms are designed for minimization.
Iv The Proposed Approach: Neural EndtoEnd Cooling Control Algorithm
In this section, we present the endtoend cooling control algorithm (CCA), adapted from the DDPG, which combines the critical RL techniques and methods such as deep Qnetwork (DQN), deterministic policy gradient (DPG), and actorcritic algorithm. In the following, we first provide an overview of the related RL concepts and techniques. We then describe a complete algorithm flow and the design of the neural networks.
Iva Overview of Qlearning and Policy Gradient
For our application, the goal is to enable an AI agent to learn an optimal cooling control policy from a data set that records a sequence of states, actions taken, and rewards at discrete time steps. Within the RL framework, this goal is achieved by using either valuebased or policybased approaches. Central to the valuebased approaches is the Qlearning technique. Though for discrete state and action space (especially when space is small), Qfunction can be represented as a table computed by Bellman equation iterative updating, in practice it is often estimated by a function approximator such as a neural network, like the Deep Qnetwork (DQN)
[12]. With policybased approaches, policygradient (PG) is an important algorithm that optimizes a policy endtoend by computing noisy estimates of the gradient of the expected reward and then updating the policy in the gradient direction. When the state or action space is represented by continuous variables, a naive adaptation of DQN or PG via discretization of state or action space often results in intractability or very slow learning convergence (even divergence). We use the DDPG algorithm, which is essentially a hybrid method combining the policy gradient method and the value function [13].IvB Online Learning vs. Batch Learning and OffPolicy vs. OnPolicy
We note that RL algorithms can be directly used as online learning algorithms. This means that the control algorithm can learn in an online manner, e.g., starting from an initial state and adjusting itself with the input it received from the ongoing process, either the real operation or the simulation. However, this will be problematic for the DC cooling task, which cannot risk erroneous settings. In this work, we focus on the control algorithms that are pretrained by the offline data first, which is referred to as “batch learning”. Batch algorithms can be further divided into two categories based on how the training data are generated: offpolicy and onpolicy. Offpolicy algorithms generally employ a separate behavior policy, which is independent of the policy being estimated, to generate the training trace; while onpolicy directly uses control policy being estimated (in the real control practice or more likely in a simulator) to generate training data traces. For the case of DC simulation, the cost of simulation time is high. Thus offpolicy algorithms are easier to apply and more suitable for our situation.
In summary, we propose an offpolicy control algorithm adapted from canonical DDPG. The algorithm employs only a single offline trace for batch learning. In the following, we introduce the details of the proposed algorithm.
IvC Cooling Control Algorithm with Offline Trace (CCA)
The flowchart of the proposed Cooling Control Algorithm (CCA) with an offline trace is shown in Algorithm 1. For the training task, a data trace is collected (line 1), which contains entries: state (, ), action (, ,, and ), and reward data computed by the objective function (1) based on the observed PUE and temperature data. Different to canonical RL approach in which the future reward data are also included in the evaluation of the action (discounted return), here the future reward information is not used, as the workload and weather trends determine the system transition. Note that as there will be an affecting time for an action to take effect, we shrink the state observation back for one time slot, such that the action we computed based on the current state is going to take effect in the next time slot. All these data are prepared as time series of time steps, which are further divided into training data and validation data.
Before the training starts, we first initialize two neural networks (line 2). The critic network (parameterized by
) approximates the Qvalue of a stateaction pair: it takes the current state and the next action to take (combined into a vector
) as the input, and outputs a scalar value which represents the cost of an action taken at a state . In this paper, we also consider recurrent decision making, which means that a short recent history of states and actions are incorporated in the input of the network, i.e., concatenating , , , ,…, , into a vector as the input to the network. Recurrent decision making can be helpful when the data are noisy, as will be shown in Section VI. We also propose a special design of the network that can ease the training, in which the second last layer of is designed to output the predicted PUE and temperature data, with which the cost is computed according to (1) in the last layer. With such design, we can easily check the predicted PUE and temperature information from the network, which can be helpful to show the quality of the network directly. The actor network (parameterized by ) is policy network: it takes the recent stateaction history and current state () and outputs the new action to take.The training procedure is illustrated in lines 321. Here we use standard neural network training procedure with multitrainingepochs. Within each epoch (line 59), each batch (in random order) of training data is used to update the weights of the neural networks using gradient descent. The critic network is updated by minimizing the mean square error between the output of the second last layer of
and the raw reward data ; while the policy network is updated by minimizing the output of when taking action at current state according to the output of . To avoid overfitting, we also compute the validation error to keep track of the best weight parameter settings for the two neural networks respectively, as shown in lines 1120. One important note is that for the network, the validation error can be small in the beginning due to that at that time the network is not well learned. For safety, we use a periodical reinitialization of the to solve this problem.IvD Neural Network Design
The setup of the proposed and network is shown in Fig. 3. For the network, it has three hidden layers (two layers with activation and one with linear output) and outputs the negative reward . To reduce the learning difficulty, the second last layer of outputs the predicted energy and temperature data, concatenated as , which is then used to compute according to (1). In training, is trained by minimizing the error between the predicted and the real data. For the network, it has two hidden layers (one with linear activation and one with activation function) and outputs the next control action .
network is optimized to reduce the loss function computed by the
network. We found that a variety of different neural network architectures can achieve similar results with necessary hyperparameter tuning. In Section VD we show the experimental results on comparing different network architectures. Note that to fit for the activation used in the neural networks, we normalize all data entries into the range (1, 1) and denormalize the output of the neural networks when they are needed to compute the real energy and temperature values.V Simulation Based Numerical Evaluation and Analysis
In this section, we present numerical evaluation results of the proposed CCA based on simulation. Simulation on the EnergyPlus platform is carried out to collect the training data and evaluate the proposed algorithms. Two baseline algorithms are compared with our proposed solution: one is the default control algorithm DefaultE+ from EnergyPlus, which computes the setpoints according a target zone temperature with the underlying model; another is a general TS control optimization algorithm adapted from [9] which is trained with the same data for the proposed approach.
Va Simulation Configurations
We use EnergyPlus to collect the training data and assess different control algorithms for the following reasons. First, it is impossible to directly test control algorithms on a real DC due to the potential risk and the long running time. Second, EnergyPlus, whose development is an initiative of the U.S. Department of Energy Building Technologies Office, is a widely recognized and reliable simulation platform to model building cooling energy consumption. Third, EnergyPlus provides the flexibility that allows simulation with userdefined algorithms, control actions, and schedules.
The simulation is configured as follows. We adopt the original DC model provided by the EnergyPlus platform to make this simulationbased study tractable, as shown in Section IIIA. We choose Singapore as the location and select the corresponding weather file to revise the simulation configuration file accordingly. We use a CPU loading trace collected from the monitoring system of Wikimedia as the workload trace for the DC model in the simulation. The whole simulation period is one year, and simulation data are collected every 6 minutes.
We use a random control algorithm to generate a oneyear simulation trace to train the proposed algorithm. The control variables are randomly selected from valid ranges (obtained from a trace generated by the DefaultE+ algorithm) for the simulation model and then smoothed to ensure that the actions fluctuate smoothly. For the whole oneyear simulation period, we select the last 45% as the test period.
VB Algorithm Configurations
For the proposed CCA, the hidden layer sizes of network are set to 50, 50 and 3, and 50, 50 for the network. The optimization algorithm to update the weight parameters is Adadelta [43]. The maximum training epoch is set to 200. The training batch size is set to 128. The penalty factor in (1) is set to 0.01; as directly controls the tradeoff between energy cost and cooling effects, we need to manually tune this parameter. More on settings of are shown in Section IVE. The temperature threshold in (1) is set to 29. Setting of is depending on the target temperature one want to achieve.
For the TS optimization algorithm, we adopt the approach from [9] with the following changes otherwise it will be incomparable to our approach. In the first stage, we train the same evaluation network like CCA to replace the original neural network designed for modeling chiller efficiency in [9]. In the second stage, an iterative differential evolution (DE) optimization algorithm provided by Scipy [44] is used to find the optimal solution for each test state. Ideally, the TS algorithm can perform no worse than the CCA algorithm if the optimization algorithm itself is optimized for this problem. Here as we focus on the design of CCA, for TS approach, a general optimization algorithm is used like in [9]; this is reasonable as in the real case designing a special optimization algorithm is not an easy task.
For the proposed CCA and the TS optimization algorithms, the optimal control settings generated by these two algorithms are tested by simulation on the EnergyPlus platform. That is, for each state at the testing phase, we use the settings provided by the CCA or the TS algorithm and then record the resulting state changes and rewards for performance evaluation.
VC Comparing CCA to Baseline Algorithms
In this section, we present the simulation results of the average PUE and maximum outlet temperature (during the test period), obtained by using DefaultE+, TS, and CCA algorithms. Based on these results we further compare and evaluate the underlying control algorithms.
DefaultE+  TS  CCA  TS  CCA  

PUE  1.3760.000  1.3710.000  1.3330.003  1.3460.023  1.3070.017 
28.6170.000  26.2420.000  28.9100.361  30.9953.121  30.1112.759  
28.6550.000  26.2880.000  29.1240.353  30.2913.307  33.2500.005 
Table I shows the first and secondorder statistics of the PUE and maximum outlet temperature in 10 independent runs with different settings for TS and CCA. For better examining these results, we also plot the data distribution in Fig. 4. We can observe the following:

The proposed CCA algorithm with achieves the best control results, by reducing the PUE from 1.37 to 1.33, which is 11% cooling power saving, while maintaining the temperature of both zones under or nearby the predefined threshold 29. This shows that the actor network can indeed attain optimal or closetooptimal control settings.

The TS algorithm with the general optimization algorithm shows unstable performance. To improve its performance, a specialized optimization procedure will be necessary. Being compared to the TS approach, the proposed CCA is an endtoend solution. CCA can directly output the control setting with the pretrained policy network, which can be carefully tuned and tested in an offline manner; while for the case of TS, an online optimization algorithm has to be used, which poses higher computation cost and accuracy problem in real time.

The recurrent version of CCA with has unstable results. This is reasonable as recurrent decision making can be beneficial if the data are noisy; however, as a simulated case is studied here, the data generated are free of noise. In Section VI we show how recurrent decision making can be useful in a case with real data collected from a physical DC.
We bring an example of the PUE and temperature traces (during the test period) obtained from our simulation in Fig. 5. We can point out that the PUE curve of CCA is lower than that of DefaultE+ and TS while achieving higher but still satisfying temperature curves in both zones. Note that at the beginning of the test period, the temperature of the DC zone has a fast drop for TS, due to transient from the DefaultE+ algorithm to the learning algorithm (the DefaultE+ algorithm provides the settings before the test period).
VD Neural Network Design Study
In this subsection, we study different designs of the neural network and compare their performances. We compare our network design with three other different implementations: 1) TargetNet: with the target network of DDPG, which is used to avoid fast changing of the
network to stabilize the training; 2) ReluNet: a fourrelulayer based
network (10245122563) similar to [45] to test whether relu activation can work or not; 3) LstmNet: with a LSTM layer to process the recent history trace. As we use the recent states and actions in the last steps as the input to thenetwork, we can use an LSTM layer to process it first as LSTM is a recurrent neural network which is suitable for dealing with sequential data. The LSTM layer outputs its hidden units which are then fed into a normal
network as described in Section IVD. We test these different designs with set to 1; for LstmNet, we show the results with set to 3.Test results of these different designs are shown in Table II. Comparing to the original CCA, Table II shows that these different architectures achieve very similar results. With TargetNet, the results are almost the same to the original design, which is reasonable as we use a long offline trace to train the network in each training episode, which can lead to a very stable learning process. With relu layers, we achieve similar results but slightly higher temperature reading with lower PUE; with LSTM layer added, we achieve a slightly lower temperature but higher PUE. Given sufficient hyperparameter tuning, these designs may also achieve satisfying performance. In this simulation case, as adding larger size relu layers and LSTM layers can slow down the training speed, we adopt the simpler design illustrated in Fig. 3.
CCA  TargetNet  ReluNet  LstmNet  

PUE  1.333e+003.371e03  1.338e+001.301e03  1.326e+002.128e07  1.343e+002.305e02 
2.891e+013.614e01  2.850e+011.281e01  2.949e+017.958e14  2.874e+012.186e+00  
2.912e+013.525e01  2.867e+011.208e01  2.951e+011.709e09  2.864e+012.640e+00 
VE HyperParameter Setting Study
Despite network architecture studied above, the hyperparameter setting is also a critical issue in applying DRL algorithm to practical applications. In this subsection, we discuss how we set the hyperparameters used in our algorithm, such that it may shed some light on similar applications. A key hyperparameter is the learning rate. In our approach, we choose to use Adadelta as the training optimization algorithm, which does not oblige us to set the learning rate manually. In our case, Adadelta works great, yet it may not work on other problems. When the learning rate needs to be set manually, we recommend trying from a small value like 1e5. Another key hyperparameter is the initialization range of the weight of the neural networks. We use zero mean and 0.01 as the standard deviation to generate the random weights. Smaller initialization range tends to result in a more stable training process.
A critical hyperparameter that related to the loss function in CCA is the penalty factor in (1). In our experiments, we found that it needs to be manually tuned to see a satisfying tradeoff between energy and temperature. In Fig. 6 we show when we change from 0.0 to 0.04, how PUE and the maximum outlet temperature of each zone changes accordingly. We notice that a proper setting of will be distinct from problem to problem. For example, in the next section, we will show the result of a different test case, in which the best setting is different from this simulation case.
Vi Tests on Real Data Trace from NSCC
To further investigate the proposed algorithm, we test the proposed CCA on a data trace collected from the National Super Computing Centre (NSCC) of Singapore and show its performance on optimizing the energy cost while satisfying the cooling requirements (rack intake temperature).
We are focused on the optimization of the air cooling system for the computing nodes in NSCC. The 3D model of the research target is illustrated in Fig. 7. There are 26 racks in the target system. Three precision cooling units (PCUs) supply cold air for these racks. The PCUs supply cold air at about 20 degrees. Cold air enters the cold aisle and then goes through the racks and at last returns back to the PCUs. There are other cooling facilities installed: for racks 120, an additional warm water cooling system is used to cool the CPU/GPU and memory chips; for racks 2126, an additional reardoor cooling system is used. The warm water cooling system and the reardoor cooling system will not be studied here thus we omit further details.
We try to optimize the total supply flow rate of the target PCUs shown in Fig. 7, aiming to minimize their power consumption while maintaining the average intake temperature of the racks (measurement point at the height of the 36U of each rack).
To apply the proposed algorithm, we collect the related data for the optimization goal, as shown in Table III. Note that several measurements of the warm water cooling system and reardoor cooling system are also used. Our experiments show that including these reading can increase the approximation accuracy of the network. We collected these data entries for every 3 minutes from March 1 to 15 of 2017. For these data, we use the first 85% as the training data and the last 15% as the test data. With the above data, we utilize the proposed algorithm to train the and network. In this case, as the power consumption can be directly computed by the fan law from the airflow rate, we will only rely on the to approximate the inlet temperature which we use as the thermal indicator.
Experimental results demonstrate that the proposed algorithm works well as expected. First, the results show that the normalized mean absolute error (MAE) of the network is smaller than 0.1 degrees, as shown in Fig. 8. This proves that the network can successfully capture the system dynamics. Fig. 8 also shows that when dealing with real data with noise, recurrent decision making is better as we found that when setting we can get the best temperature estimation results. Even though MAE lower than 0.1 degree seems great, it can underestimate the temperature as the regression process will try to fit more on most of the data falling in the middle of the distribution. This may lead to overoptimistic energy saving estimation. As we cannot directly apply the CCA algorithm to a real DC, it is important that we can generate some convincing theoretical results first without underestimation. To solve this problem, we change the validation strategy in Algorithm 1 and apply a special deunderestimation (DUE) validation method, which works in the following manner. In line 11 of CCA, when computing the validation error, we replace the original square error into a function which only considers the underestimation error, as shown below:
(1) 
With such DUE validation method, we can reduce the underestimation cases, as shown in Fig. 9.
Second, we study how much energy saving we can achieve when using different penalty parameter (used in the objective function (1)), as shown in Fig. 10. In Fig. 11 we also show how the control actions and predicted temperature change with different . From Fig. 10 we can see that with DUE, we can achieve energy saving 15% easily if we set the temperature threshold larger than 26.6 degrees. If we want a lower maximum temperature such as 26.4, we can still save about 10%. When we set the target maximum temperature lower than 26.2, we will have to use more cooling power than the actual setting adopted.
State 



Action  F4: the airflow supply rate of the target PCUs  
Reward 

Vii Conclusion
DC powers the modern society as the infrastructure for information storage, processing, and dissemination. At the same time, DC consumes a formidable amount of electricity, among which a significant portion is used in cooling. To develop an optimal control policy for the complex cooling system for a DC is a complex task. We propose and verify an endtoend DRL approach CCA for the control optimization of the cooling system of a DC. Compared to the existed TS optimization method, the proposed CCA can directly optimize a policy network based on the observed historical data, while the policy can output the optimized control settings for any given state. Adapted from DDPG and the actorcritic framework, our algorithm is a batch offpolicy algorithm, with which we tested various algorithm settings to examine performance thoroughly.
We evaluate the proposed algorithm on the simulation platform EnergyPlus and a trace collected from a real DC. The simulation results show that our method can maintain the DC temperature within the predefined threshold while achieving lower PUE, and save about 11% cooling energy compared to a baseline approach with manually designed control settings. Our results on the real trace show that we can achieve high evaluation accuracy and the predicted control setting can reduce the cooling cost around 15% while maintaining the rack intake temperature under a predefined threshold. The results prove that our algorithm can successfully learn the system dynamics from the monitoring data and can contribute to improving the cooling efficiency.
References
 [1] “Cisco Global Cloud Index: Forecast and Methodology, 20142019,” http://www.cisco.com/c/en/us/solutions/collateral/serviceprovider/globalcloudindexgci/Cloud_Index_White_Paper.html, accessed: 20150701.
 [2] “America’s data centers consuming and wasting growing amounts of energy,” http://www.nrdc.org/energy/datacenterefficiencyassessment.asp, accessed: 20150701.
 [3] J. Ni and X. Bai, “A review of air conditioning energy performance in data centers,” Renewable and sustainable energy reviews, vol. 67, pp. 625–640, 2017.
 [4] J. Sun and A. Reddy, “Optimal control of building HVAC systems using complete simulationbased sequential quadratic programming (CSBSQP),” Building and environment, vol. 40, no. 5, pp. 657–669, 2005.
 [5] B. Ahn and J. Mitchell, “Optimal control development for chilled water plants using a quadratic representation,” Energy and Buildings, vol. 33, no. 4, pp. 371–378, 2001.
 [6] L. Lu, W. Cai, L. Xie, S. Li, and Y. C. Soh, “HVAC system optimization—inbuilding section,” Energy and Buildings, vol. 37, no. 1, pp. 11–22, 2005.
 [7] Y.C. Chang, “A novel energy conservation method—optimal chiller loading,” Electric Power Systems Research, vol. 69, no. 2, pp. 221–226, 2004.
 [8] Z. Ma and S. Wang, “An optimal control strategy for complex building central chilled water systems for practical and realtime applications,” Building and Environment, vol. 44, no. 6, pp. 1188–1198, 2009.

[9]
T. Chow, G. Zhang, Z. Lin, and C. Song, “Global optimization of absorption chiller system by genetic algorithm and neural network,”
Energy and buildings, vol. 34, no. 1, pp. 103–109, 2002.  [10] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
 [11] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE transactions on systems, man, and cybernetics, no. 5, pp. 834–846, 1983.
 [12] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
 [13] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [14] “DeepMind AI reduces Google data centre cooling bill by 40%,” https://deepmind.com/blog/deepmindaireducesgoogledatacentrecoolingbill40/, accessed: 20170301.
 [15] M. Dayarathna, Y. Wen, and R. Fan, “Data center energy consumption modeling: A survey,” IEEE Communications Surveys & Tutorials, vol. 18, no. 1, pp. 732–794, 2016.
 [16] W. Zhang, Y. Wen, Y. W. Wong, K. C. Toh, and C.H. Chen, “Towards joint optimization over ”ict” and cooling systems in data centre: A survey,” IEEE Communications Surveys & Tutorials, vol. 18, no. 3, pp. 1596–1616, 2016.
 [17] W. Xia, Y. Wen, K.C. Toh, and Y.W. Wong, “Toward green data centers as an interruptible load for grid stabilization in Singapore,” IEEE Communications Magazine, vol. 53, no. 11, pp. 192–198, 2015.
 [18] Y. Li, H. Hu, Y. Wen, and J. Zhang, “Learningbased power prediction for data centre operations via deep neural networks,” in Proceedings of the 5th International Workshop on Energy Efficient Data Centres, no. 6. ACM, 2016.
 [19] J. Yin, P. Sun, Y. Wen, H. Gong, M. Liu, X. Li, H. You, J. Gao, and C. Lin, “Cloud3dview: an interactive tool for cloud data center operations,” in ACM SIGCOMM Computer Communication Review, vol. 43, no. 4. ACM, 2013, pp. 499–500.
 [20] I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska, “A survey of actorcritic reinforcement learning: Standard and natural policy gradients,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 6, pp. 1291–1307, 2012.
 [21] D. B. Crawley, L. K. Lawrie, F. C. Winkelmann, W. F. Buhl, Y. J. Huang, C. O. Pedersen, R. K. Strand, R. J. Liesen, D. E. Fisher, M. J. Witte et al., “Energyplus: creating a newgeneration building energy simulation program,” Energy and buildings, vol. 33, no. 4, pp. 319–331, 2001.
 [22] K. Fouladi, A. P. Wemhoff, L. SilvaLlanca, K. Abbasi, and A. Ortega, “Optimization of data center cooling efficiency using reduced order flow modeling within a flow network modeling approach,” Applied Thermal Engineering, vol. 124, pp. 929–939, 2017.
 [23] U. Singh, H. G. Bhagwat, A. Varghese, A. K. Singh, R. Jayaprakash, A. Sivasubramaniam et al., “System and method for facilitating optimization of cooling efficiency of a data center,” Jun. 5 2018, uS Patent 9,990,013.
 [24] J. E. Braun, “Reducing energy costs and peak electrical demand through optimal control of building thermal storage,” ASHRAE transactions, vol. 96, no. 2, pp. 876–888, 1990.
 [25] C.C. Lo, S.H. Tsai, and B.S. Lin, “Ice storage airconditioning system simulation with dynamic electricity pricing: A demand response study,” Energies, vol. 9, no. 2, p. 113, 2016.
 [26] E. Pakbaznia and M. Pedram, “Minimizing data center cooling and server power costs,” in Proceedings of the 2009 ACM/IEEE international symposium on Low power electronics and design. ACM, 2009, pp. 145–150.
 [27] L. Li et al., “Coordinating liquid and free air cooling with workload allocation for data center power minimization,” in 11th International Conference on Autonomic Computing (ICAC 14). USENIX Association, 2014.
 [28] A. Banerjee, T. Mukherjee, G. Varsamopoulos, and S. K. Gupta, “Integrating cooling awareness with thermal aware workload placement for HPC data centers,” Sustainable Computing: Informatics and Systems, vol. 1, no. 2, pp. 134–150, 2011.
 [29] K. Zheng, W. Zheng, L. Li, and X. Wang, “Powernets: Coordinating data center network with servers and cooling for power optimization,” IEEE Transactions on Network and Service Management, vol. 14, no. 3, pp. 661–675, 2017.
 [30] F. L. Lewis and D. Liu, Reinforcement learning and approximate dynamic programming for feedback control. John Wiley & Sons, 2013, vol. 17.
 [31] D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discretetime nonlinear systems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 3, pp. 621–634, 2014.
 [32] B. Luo, H.N. Wu, and T. Huang, “Offpolicy reinforcement learning for control design,” IEEE transactions on cybernetics, vol. 45, no. 1, pp. 65–76, 2015.
 [33] D. Liu, X. Yang, D. Wang, and Q. Wei, “Reinforcementlearningbased robust controller design for continuoustime uncertain nonlinear systems subject to input constraints,” IEEE transactions on cybernetics, vol. 45, no. 7, pp. 1372–1385, 2015.
 [34] H. Modares, I. Ranatunga, F. L. Lewis, and D. O. Popa, “Optimized assistive human–robot interaction using reinforcement learning,” IEEE transactions on cybernetics, vol. 46, no. 3, pp. 655–667, 2016.
 [35] Y. Pan and H. Yu, “Biomimetic hybrid feedback feedforward neuralnetwork learning control,” IEEE transactions on neural networks and learning systems, vol. 28, no. 6, pp. 1481–1487, 2017.
 [36] R. Song, F. L. Lewis, and Q. Wei, “Offpolicy integral reinforcement learning method to solve nonlinear continuoustime multiplayer nonzerosum games,” IEEE transactions on neural networks and learning systems, vol. 28, no. 3, pp. 704–713, 2017.
 [37] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforcement learning for financial signal representation and trading,” IEEE transactions on neural networks and learning systems, vol. 28, no. 3, pp. 653–664, 2017.
 [38] C. J. Watkins and P. Dayan, “Qlearning,” Machine learning, vol. 8, no. 34, pp. 279–292, 1992.
 [39] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Qlearning,” CoRR, abs/1509.06461, 2015.

[40]
J. Koutník, J. Schmidhuber, and F. Gomez, “Evolving deep unsupervised
convolutional networks for visionbased reinforcement learning,” in
Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation
. ACM, 2014, pp. 541–548.  [41] K. Narasimhan, T. Kulkarni, and R. Barzilay, “Language understanding for textbased games using deep reinforcement learning,” arXiv preprint arXiv:1506.08941, 2015.
 [42] “Wikimedia Grid Report,” https://ganglia.wikimedia.org/latest/, accessed: 20150701.
 [43] M. D. Zeiler, “ADADELTA: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
 [44] E. Jones, T. Oliphant, P. Peterson et al., “Open source scientific tools for python,” 2001.

[45]
“Deep deterministic policy gradients in TensorFlow,”
http://pemami4911.github.io/blog/2016/08/21/ddpgrl.html, accessed: 20170301.
Comments
There are no comments yet.