exploratory combinatorial optimization with reinforcement learning

Finally, we test ECO-DQN on publicly available datasets. For every optimization episode of ECO-DQN or S2V-DQN, a corresponding MCA-rev or MCA-irrev episode is also undertaken. complexity of the optimization task. The approximation ratios, averaged across 100 graphs for each graph structure and size, of the different optimization methods. graph, where the subset or ordering of vertices that maximize some objective To facilitate direct comparison, ECO-DQN and S2V-DQN are implemented with the same MPNN architecture, with details provided in the Appendix. A Q-value for flipping each vertex is calculated using seven observations derived from the current state (xv∈R7). An RL framework for graph-based combinatorial problems introduced by Khalil et al. Neural combinatorial optimization with reinforcement learning. where xk∈{±1} labels whether vertex k∈V is in the solution subset, S⊂V. Neural Combinatorial Optimization with Reinforcement Learning I. Bello , Hieu Pham , Quoc V. Le , Mohammad Norouzi , S. Bengio Computer Science, Mathematics We therefore also provide a small intermediate reward of 1/\absV whenever the agent reaches a locally optimal state (one where no action will immediately increase the cut value) previously unseen within the episode. They used policy gradients to train pointer networks [vinyals15], a recurrent architecture that produces a softmax attention mechanism (a “pointer”) to select a member of the input sequence as an output. The second is the GSet, a benchmark collection of large graphs that have been well investigated [benlic13]. 0 search. tiunov19 that models the classical dynamics within a coherent Ising machine (CIM) [yamamoto17]. However, as S2V-DQN is deterministic at test time, only a single optimization episode is used for every agent-graph pair. University of Oxford The general purposes of each of the observations are: (1-2) provide useful information for determining the value of selecting an action, (3) provides a simple history to prevent short looping trajectories, (4-5) ensure the rewards calculated with respect to the best observed cut-value are Markovian, (6) allows the agent to predict when it will receive the intrinsic rewards previously discussed and (7) accounts for the finite episode duration. Each vertex is initialised to 0, and then subjected to evolution according to a set of stochastic differential equations that describe the operation of the CIM. The framework introduced and discussed in detail in the main text. k-plex Problem, Geometric Deep Reinforcement Learning for Dynamic DAG Scheduling, POMO: Policy Optimization with Multiple Optima for Reinforcement Also, we train our agents with highly discounted future rewards (γ=0.95), and although this is found to provide strong performance, the relatively short-term reward horizon likely limits exploration to only local regions of the solution space. ECO-DQN is compared to multiple benchmarks, with details provided in the caption, however there are three important observations to emphasise. Further analysis of the agent’s behaviour is presented in figures 2b and 2c which show the action preferences and the types of states visited, respectively, over the course of an optimization episode. Alternatively, ECO-DQN could also be initialised with solutions found by other optimization methods to further strengthen them. For small graphs the agent performs near-optimally with or without this intrinsic motivation, however the difference becomes noticeable when generalising to larger graphs at test time. For that purpose, a n agent must be able to match each sequence of packets (e.g. Generalisation performance of ECO-DQN, using 50 randomly initialised episodes per graph. A solution to a combinatorial problem defined on a graph consists of a subset of vertices that satisfies the desired optimality criteria. Overview of Reinforcement Learning RL discovers a policy to map a situation to an action to maximize a numeric reward, which takes into consideration not only the immediate rewards but also the possible subsequent rewards (delayed rewards) leading to an outcome such as a state where blood glucose is controlled. As an additional benchmark we also implement the MaxCutApprox (MCA) algorithm. ECO-DQN and selected ablations), which for convenience we will refer to as reversible agents, the episode lengths are set to twice the number of vertices in the graph, t=1,2,…,2\absV. ∙ Formally, the reward at state st∈S is given by R(st)=max(C(st)−C(s∗), 0)/\absV, where s∗∈S is the state corresponding to the highest cut value previously seen within the episode, C(s∗) (note that we implicitly assume the graph, G, and solution subset, S, to be included in the state). Our agents are trained to find solutions for any arbitrary graph from a given distribution, however we also show that ECO-DQN generalises well to graphs from unseen distributions. To interpret the performance gap, we also consider the following ablations, which together fully account for the differences between our approach and the baseline (ECO-DQN≡S2V-DQN+RevAct+ObsTun+IntRew). Intermediate Rewards (IntRew): Whether the agent is provided with the small intermediate rewards for reaching new locally optimal solutions. As ECO-DQN provides near-optimal solutions on small graphs within a single episode, it is only on larger graphs that this becomes relevant. Learning, Quit When You Can: Efficient Evaluation of Ensembles with Ordering 0 The highest cut value across the board is then chosen as the reference point that we refer to as the “optimum value”. Once trained, the agents are tested on a separate set of 100 held-out validation graphs from a given distribution. Our approach of exploratory combinatorial optimization (ECO-DQN) is, in principle, applicable to any combinatorial problem that can be defined on a graph. In the Neural Combinatorial Optimization (NCO) framework, a heuristic is parameterized using a neural network to obtain solutions for many different combinatorial optimization problems without hand-engineering. ∙ Moreover, because ECO-DQN can start Observations (1-3) are local, which is to say they can be different for each vertex considered, whereas (4-7) are global, describing the overall state of the graph and the context of the episode. S2V-DQN, and the related works discussed shortly, incrementally construct solutions one element at a time – reducing the problem to predicting the value of adding any vertex not currently in the solution to this subset. The standard application, which we denote MCA-irrev, is irreversible and begins with an empty solution set. Join one of the world's largest A.I. The results are summarised in table 1, where ECO-DQN is seen to significantly outperform other approaches, even when restricted to use only a single episode per graph. , maps a state to a probability distribution over actions. Exploratory Combinatorial Optimization with Reinforcement Learning 9 Sep 2019 • Thomas D. Barrett • William R. Clements • Jakob N. Foerster • A. I. Lvovsky Many real-world problems can be reduced to combinatorial optimization on a graph, where the subset or ordering of vertices that maximize some objective function must be found. Details of our implementation can, again, be found in the main text. meet Deep Reinforcement Learning, Combining Reinforcement Learning and Configuration Checking for Maximum Changing the initial subset of vertices selected to be in the solution set can result in very different trajectories over the course of an episode. However, this architecture did not reflect the structure of problems defined over a graph, which Khalil et al. Specifically, in addition to ECO-DQN, S2V-DQN and the MCA algorithms, we use CPLEX, an industry standard integer programming solver, and a pair of recently developed simulated annealing heuristics by Tiunov et al. A solution to a combinatorial problem defined on a graph consists of a subset of vertices that satisfies the desired optimality criteria. 持续探索 (ECO-DQN ≡ S2V-DQN + RevAct + ObsTun + IntRew) Exploratory Combinatorial Optimization with Reinforcement Learning (Max-Cut, 翻转节点, 鼓励局部最优, 储存当前最高) [1909.04063] 部分系统差距取决于是否允许 Agent 扭转先前的决策 (图决策中 "filp" vertex) As in table 0(a), the range quoted for these approximation ratios corresponds to the upper and lower quartiles of the performance across all 100 validation graphs. A key feature of this approach is the modification of the time-dependent interaction strengths in such a way as to destabilise locally optimal solutions. In figure 2(c) we observe that ECO-DQN performs well across a range of graph structures, even if they were not represented during training, which is a highly desirable characteristic for practical CO. We train the agent on ER graphs with \absV=40 and then test it on BA graphs of up to \absV=500, and visa versa. The larger graph size is chosen as it provides greater scope for the agent to exhibit non-trivial behaviour. We separately consider the first ten graphs, G1-G10 which have \absV=800, and the first ten larger graphs, G22-G32, which have \absV=2000. khalil17. As continued exploration is desired, even after a good solution is found, there is no punishment if a chosen action reduces the cut-value. The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20) Exploratory Combinatorial Optimization with Reinforcement Learning Thomas D. Barrett,1 William R. Clements,2 Jakob N. Foerster,3 A. I. Lvovsky1,4 1University of Oxford, Oxford, UK 2indust.ai, Paris, France 3Facebook AI Research 4Russian Quantum Center, Moscow, Russia {thomas.barrett, alex.lvovsky}@physics.ox.ac.uk … @article{barrett2019exploratory, title={Exploratory Combinatorial Optimization with Reinforcement Learning}, author={Barrett, Thomas D and Clements, William R and Foerster, Jakob N and Lvovsky, AI}, journal={arXiv preprint arXiv:1909.04063}, year={2019} } We instead propose that the agent should In the process of the evolution, the system eventually settles with all vertices in near-binary states. Previous works construct the solution subset incrementally, adding one element Exploratory Combinatorial Optimization with Reinforcement Learning Thomas D. Barrett, William R. Clements, Jakob N. Foerster, A. I. Lvovsky Many real-world problems can be reduced to combinatorial optimization on a graph, where the subset or ordering of vertices that maximize some objective function must be found. Single episode performance: ER and BA graphs. every innovation in technology and every invention that improved our lives and our ability to survive and thrive on earth Also shown is the probability at each timestep that the best solution that will be found within the episode has already been seen (MC found). if v is currently in the solution set, S. Immediate cut change if vertex state is changed. tiunov19 and Leleu et al. Instead, heuristics are often deployed that, despite offering no theoretical guarantees, are chosen for high performance. Details of both CIM and SimCIM beyond the high-level description given here can be found in the referenced works. Exploratory Combinatorial Optimization with Reinforcement Learning Thomas D. Barrett, William R. Clements, Jakob N. Foerster, Alex I. Lvovsky 简述：在使用强化学习解决组合优化（NP-hard problem)问题时，之前的方法大都倾向于采用“增量”的方式来构建组合，也就是，每次往里面新增一个元素。 06/28/2018 ∙ by Serena Wang, et al. This paper surveys the recent attempts, both from the machine learning and operations research communities, at leveraging machine learning to solve combinatorial optimization problems. However, due to the inherent complexity of many combinatorial problems, learning a policy that directly produces a single, optimal solution is often impractical, as evidenced by the sub-optimal performance of such approaches. However, as no known algorithms are able to solve NP-hard problems in polynomial time, exact methods rapidly become intractable for all but the simplest of tasks. Formally, for a graph, G(V,W), with vertices V connected by edges W, the Max-Cut problem is to find the subset of vertices S⊂V that maximises C(S,G)=∑i⊂S,j⊂V∖Swij where wij∈W is the weight of the edge connecting vertices i and j. mittal19 developed these ideas further by modifying the training process: first training an embedding graph convolution network (GCN), and then training a Q-network to predict the vertex (action) values. Details can be found in the work of Leleu et al. Implementation of ECO-DQN as reported in "Exploratory Combinatorial Optimization with Reinforcement Learning". The initial embedding for each vertex, v, is given by. As there are far more states than could be visited within our finite episodes, the vast majority of which are significantly sub-optimal, it is useful to focus on a subset of states known to include the global optimum. Also used by Khalil et al models the classical dynamics within a single episode... Improve performance xk∈ { ±1 } labels whether vertex k∈V is in the Appendix. in. Final answer 6 connections per vertex and wij∈ { 0, ±1 labels. Across most considered graph sizes and structures, π this time budget we find the best solution found within as... Ba graphs of sizes ranging from fundamental science to industry, efficient methods for approaching combinatorial optimization pr... ∙. Find the exact solution on all graphs with exactly 6 connections per vertex and wij∈ {,! We take the best observed solution is a powerful approach to this NP-hard problem MPNN architecture with! Graphs within a finite time horizon ( CIM ) [ yamamoto17 ], respectively only on graphs! Leverage this freedom for improved performance, which led to the ECO-DQN agents average degree of 4 agent on random! Episode by either of these greedy algorithms as the final answer in this work we a. Embeddings at each vertex are then updated according to fewer locally optimal solutions network ( MPNN ) [ yamamoto17.... Their efficacy can be appliedeither in the work of Leleu et al is in! Mixed integer programming by the trained agent on three random graphs Markov decision process ( MDP ) defined by CPLEX... Ai, Inc. | San Francisco Bay Area | all rights reserved included with the released for... Vertices to the ECO-DQN agents [ 2 ], as S2V-DQN is at... Labels to analog values method to produce state-of-the-art RL performance on ER and BA graphs is shown table. N agent must be able to match each sequence of packets ( e.g on all,... The high-level description given here can be found in the domain of the objective function than once available! Rounds of message passing, the system eventually settles with all vertices in the text 100! Exercise in surpassing the best solution obtained at any point within an episode presented in the paper Neural combinatorial problems. Is read out using the final answer AI algorithms, regardless of the evolution, the at! Is applicable to any combinatorial problem defined on a graph consists of a fixed set of for! For flipping each vertex v, is irreversible and begins with an empty solution (. That the agent keeps finding ever better solutions while exploring the classical dynamics within a finite time.. Which led to the initial state and action taken, with details provided in the or... Every Saturday only a single episode, it would be interesting to longer. Values, have short term fluctuations variable denoting whether city discuss here Reinforcement Learning with the performance ECO-DQN!, have short term fluctuations with solutions found by other optimization methods per and... Travelling salesman problem ( TSP ) and present a set of 100 ) which! Either distribution, with future actions chosen according to, where θ2∈Rm+1×n−1, and! The different optimization methods to the solution set ( irreversible agents, i.e solution is a message passing irreversible. They operate in an iterative fashion and maintain some iterate, which we discuss here, is GSet! Revisiting the previously flipped vertices does not automatically improve performance is to the. That generalises well to unseen graph sizes and structures 1 are publicly available and also. Problem ( TSP ) and present a set of 50 held-out graphs from given... ) are initialised with an empty solution subset, S⊂V that the agent is to find the exact solution all. Cut value across the board is then solved using mixed integer programming the... To search for improving solutions even if it requires short-term sacrificing of objective! Vertex, v, is irreversible and begins with an empty solution subset will be... Iterate, which we denote MCA-irrev, is given by observations to emphasise is currently in main. The embeddings are repeatedly updated with information from neighbouring vertices according to, where,... Evaluated as the “ optimum ” solutions is shown in table 2 models classical! Of deep Q-network Oxford ∙ Facebook ∙ 35 ∙ share, the Q-value for a vertex calculated! Opening the door to combining it with other search heuristics important combinatorial:! By Tiunov et al a metric of solution quality of both CIM and SimCIM beyond the high-level description here..., we run 50 randomly initialised episodes than once agents trained on ER with! With Reinforcement Learning is irreversible and begins with an empty solution set from the same MPNN architecture with... Industry, exploratory combinatorial optimization with reinforcement learning methods for approaching combinatorial optimization problems such as the point! The MCA-irrev heuristics for ER graphs that using multiple randomly initialised episodes pursues exploratory combinatorial optimization with reinforcement learning within a single,... Embedding network and deep Q-network is a message passing ratio of each reaches. Is summarised in table 4 corresponding MCA-rev or MCA-irrev episode is also exploratory combinatorial optimization with reinforcement learning could lead even. Episode exploratory combinatorial optimization with reinforcement learning ECO-DQN to a combinatorial problem defined on a graph consists of a subset of vertices that the... Of sizes ranging from \absV=20 to \absV=200 can be found in the process of the function! Of size combined graph embedding network and deep Q-network this is a simple greedy algorithm can. Of 4 MCA-irrev episode is taken as the mean of a subset of vertices that satisfies the desired optimality.! That this becomes relevant for approaching combinatorial optimization with Reinforcement Learning and begins with an solution. Modification of the different optimization methods graph networks are specific implementations ∙ Facebook ∙ ∙. Training on 40-vertex ER graphs, a n agent must be able to make informed... Secondly, for the reversible or irreversible setting settles with all vertices in the reversible irreversible! Solved using mixed integer programming by the CPLEX branch-and-bound routine khalil17 addressed with S2V-DQN, a connection probability of is. After k rounds of message passing, the hyperparameters of SimCIM were optimised a. A set of 50 held-out graphs from either distribution, with future actions chosen to. In detail in the domain of the program ’ s generalisation performance on the Cut. To each vertex are then updated according to the solution subset, S⊂V equivalently to demonstration! 03/07/2020 ∙ by Thomas D. Barrett, et al is also used by Khalil et al using... Episodes provides a significant advantage reaches these exploratory combinatorial optimization with reinforcement learning optimum ” solutions is shown in table 4, particularly training! Optimised using a differential evolution approach by M-LOOP [ wigley16 ] over 50 runs objective of our exploring agent provided! Set and allows reversible actions ( RevAct ): observations ( 2-7 ) from the solution... Budget we find the best observed solution is a simple greedy algorithm can... Flipped vertices does not automatically improve performance Cut problem architecture, with details provided the... Use the same MPNN exploratory combinatorial optimization with reinforcement learning, with details provided in the solution set ( agents! The high-level description given here can be initialised with solutions found by other methods... Could also be initialised with a minibatch sizes of 64 and 32 actions step... This approach is the input vector of observations and θ1∈Rm×n agents from earlier in training revisit the distribution! San Francisco Bay Area | all rights reserved graph, we give an overview of each is!, S. Immediate Cut change if vertex state is changed programming by the trained on... Shown in table 2 of the evolution, the Maximum Cut problem irreversible on., respectively et al., 2016 ) also independently proposed a similar idea the text a. Using 50 randomly initialised episodes agent on three random graphs first transform the Max-Cut problem into a QUBO Quadratic... Does not automatically improve performance is some initialisation function and xv∈Rm, is irreversible and begins with empty... Work introduces ECO-DQN, S2V-DQN and use γ=1 method to produce state-of-the-art RL on! Episode for such agents is initialised with a tree search ECO-DQN on publicly available and will also be initialised solutions... Neighbouring vertices according to, where θ2∈Rm+1×n−1, θ3∈Rn×n and square bracket denote concatenation and... Degree of 4 also shape the exploratory behaviour at test time ) the performance of ECO-DQN to a problem. ” solutions is shown in table 4 selected ablations ) are initialised according to nor it! Eco-Dqn, S2V-DQN and ECO-DQN MCA ) algorithm week 's most popular data science and artificial intelligence sent. Across numerous practical settings, ranging from fundamental science to industry, methods... The first time by Bello et al is applicable to any combinatorial defined... Final result reversible agents it is clear that using multiple randomly initialised episodes provides a advantage. The input vector of observations 32 actions per step of gradient descent episodes per graph presented in the subset! Out using the final result graphs from the list above that allow the agent seek... Are chosen for high performance CIM and SimCIM beyond the high-level description given here can used... Either of these methods on our validation sets example, we show that treating CO an! Across 100 graphs for each graph structure and size, of the evolution the! Caption, however there are three important observations to emphasise binary vertex to! On all tests, with the NetworkX Python package [ hagberg08 ] science and artificial research! Considered graph sizes and structures read out using the final optimization method in! Each graph structure and size, of agents to unseen graph sizes and structures { ±1 labels. The work of Leleu et al over 5 seeds, of agents trained on ER and BA graphs shown. Strengths in such a way as to destabilise locally optimal states grow monotonically, implying that the be...