Creating a multi-armed bandit
A multi-armed bandit problem is a classic example used in reinforcement learning to describe a scenario where an agent must choose between multiple actions (or "arms") without knowing the expected reward of each. Over time, the agent learns which arm yields the highest reward by exploring each option. This exercise involves setting up the foundational structure for simulating a multi-armed bandit problem.
The numpy library has been imported as np.
Diese Übung ist Teil des Kurses
Reinforcement Learning with Gymnasium in Python
Anleitung zur Übung
- Generate an array
true_bandit_probswith random probabilities representing the true underlying success rate for each bandit. - Initialize two arrays,
countsandvalues, with zeros;countstracks the number of times each bandit has been chosen, and values represents the estimated winning probability of each bandit. - Create
rewardsandselected_armsarrays, to store the rewards obtained and the arms selected in each iteration.
Interaktive Übung
Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.
def create_multi_armed_bandit(n_bandits):
# Generate the true bandits probabilities
true_bandit_probs = ____
# Create arrays that store the count and value for each bandit
counts = ____
values = ____
# Create arrays that store the rewards and selected arms each episode
rewards = ____
selected_arms = ____
return true_bandit_probs, counts, values, rewards, selected_arms