Bi-Level Contextual Bandits: Fair Resource Allocation When Feedback Is Delayed

How do you allocate scarce resources to people over time when outcomes arrive late, populations change, and you still care about fairness? A bi-level contextual bandit architecture offers a surprisingly practical answer.
Introduction / Core Idea
Imagine you're running a tutoring or job training program with a fixed budget. Each round you must decide who gets help and what kind of help, but:
- The effect of an intervention shows up weeks or months later.
- People arrive and leave in cohorts (semesters, program batches).
- You must respect fairness across demographic groups, and
- Real world rules like cooldown periods ("don't send the same resource twice in a row to the same person") apply.
The paper this post is based on introduces a two level contextual bandit system that tackles all of that at once. At a high level:
- The meta level decides how much budget each subgroup (e.g., "Black students", "Hispanic students") gets for each resource type.
- The base level then chooses the best individuals inside each subgroup, using a contextual bandit driven by a learned outcome model.
Crucially, the method explicitly models delayed feedback via resource specific delay kernels and enforces cooldown and budget constraints, so policies are not just statistically good but also deployable.
How It Works
The system starts from a standard contextual bandit view: each action is an (individual, resource) pair, with a context vector holding demographics and history; a learned predictor maps context to expected outcome (e.g., GPA, employment probability).
Three real world wrinkles are added:
-
Delayed rewards with kernels
Each resource has a delay kernel that spreads its impact over future rounds (built from discretized Beta distributions). That lets the algorithm learn under realistic "outcome arrives slowly" dynamics instead of assuming instant feedback. -
Cooldown constraints
After someone receives a resource, a cooldown window makes them temporarily ineligible for that same resource, preventing unrealistic repeated allocations and reflecting operational or ethical limits. -
Dynamic cohorts
The horizon is split into blocks (semesters, months). Each cohort is only eligible in its block, so the policy must adapt as the population shifts over time.
On top of this, the algorithm introduces a bi-level structure:
-
Meta level (group budgets):
Works on the simplex of "how much budget each group–resource pair gets". It evaluates candidate allocations by simulating rollouts using the shared outcome model as a surrogate and applies an UCB style acquisition rule over this meta policy space. -
Base level (individual selection):
Given the group budgets, it runs a contextual UCB within each subgroup, picking the individuals with the highest UCB scores while respecting cooldown and budget.
Empirically, on education and workforce datasets, this bi-level setup improves cumulative reward and reduces disparity between groups compared to single level bandit baselines.
Examples
Below are conceptual prompts you could use to guide an LLM powered planner that implements ideas from this framework. They're simplified but capture the flavor.
Designing a Bi-Level Allocation Policy
You are an AI policy planner.
We have 3 demographic groups (A, B, C) and 2 resources:
- R1: intensive tutoring (expensive, strong delayed effect)
- R2: light mentoring (cheaper, moderate delayed effect)
Constraints:
- Total budget: 100 allocations per semester
- Each person can receive the same resource at most once every 3 rounds (cooldown).
- We want to reduce disparity in average outcome between groups.
Step 1 (Meta level): Propose a subgroup level budget split over (group, resource) that balances expected utility and fairness, assuming group B is historically under served.
Step 2 (Base level): Describe how you would select individuals inside each group using a contextual UCB policy and a learned outcome model.
Return:
- A table with the proposed (group, resource, budget share).
- A short explanation of how delayed effects and cooldowns would be handled.
Expected Output (sketch)
- A table like:
- A: R1=25, R2=15
- B: R1=30, R2=20
- C: R1=5, R2=5
- Explanation that:
- Group B gets extra budget to close historical gaps.
- Delayed rewards are modeled via resource specific kernels when evaluating candidate policies.
- The base level UCB uses prediction mean + uncertainty while respecting cooldown filters on the candidate set.
Simulating Delayed Feedback
You simulate a delayed feedback bandit environment for student tutoring.
- R1 ("intensive coaching") has a slow but strong effect peaking after 4 rounds.
- R2 ("quick check in") has a fast but weaker effect peaking after 1 round.
Define discrete delay kernels over 5 rounds for R1 and R2, and explain how they change the regret signal for an allocation algorithm that assumes immediate feedback vs. one that uses the correct kernels.
Expected Output (sketch)
- Two simple 5 step kernels like:
- R1: [0.0, 0.1, 0.3, 0.4, 0.2]
- R2: [0.5, 0.3, 0.1, 0.1, 0.0]
- Explanation that:
- An "immediate" algorithm will misattribute reward timing, over penalize R1 early, and under explore it.
- A kernel aware algorithm spreads reward across future rounds, so regret estimates reflect the true temporal profile.
Fairness vs. Utility Trade off
You are given logs from a contextual bandit that treated "all individuals equally" without subgroup budgets.
Analysis shows that one subgroup received 60% of all allocations while another received only 10%, despite similar predicted benefits.
Explain how introducing a meta level allocator with subgroup budgets can reduce disparity while still using a base level contextual UCB inside each subgroup.
Expected Output (sketch)
- The meta level constrains how much any subgroup can be over or under served.
- Within each subgroup, the base level bandit still targets the best individuals, preserving personalization.
- Under reasonable assumptions, disparity in group level outcomes is provably reduced compared to a flat bandit.
Insights / Practical Takeaways
For practitioners, the main message is that you don't have to choose between realism, fairness, and tractability:
- You can model delays explicitly with resource specific kernels instead of pretending everything is immediate.
- You can encode fairness as group level budget constraints at a meta level, while keeping efficient contextual bandits at the base level.
- Cooldowns and cohort blocks are not afterthoughts; they can be first class citizens in the optimization problem, which matters if you actually want to deploy these systems in education, healthcare, or workforce settings.
Conclusion
Bi-level contextual bandits turn a messy real world problem allocating limited interventions over time, under delay and fairness constraints into a structured, learnable pipeline. You get a global controller that reasons about who should get how much, and a local controller that decides who gets it next, all while respecting delayed impact and operational rules.
If you're building policy optimization or recommendation systems in sensitive domains, this architecture is a strong blueprint: align constraints and fairness at the top, exploit fine grained heterogeneity at the bottom, and treat delayed feedback as a core modeling choice rather than noise.


