- To keep things simple, assume the sets used in the following example are finite, and if not stated otherwise, all probability spaces use the power set as the sigma-algebra. - We have a set of messages $\Omega$ and we have a set of codes $C=\{c_1, \dots, c_n\}$. A sender sends a message $\omega\in\Omega$ by sending one of the codes. By decoding any code $c_i$, in general, the receiver can only determine with certainty that an element of a particular subset $A_i \subset \Omega$ was sent as a message. Note that the sender is truthful, meaning that they know the decoding process and if they want to convey a message $\omega$, they will only send codes from $\{c_i~|~ \omega \in A_i\}$. - We also have a frequency-based probability distribution $p$ over the codes that tells us what is the probability of each code being picked i.e., $p(c_i)$. - Note that inherently this probability distribution is compatible with a (uncountably infinite) set of probability distributions over the message space $\Omega$. For example, assume $\Omega=\{\omega_1, \dots, \omega_4\}$, $C=\{c_1, c_2\}$ and $A_1=\{\omega_1, \omega_2\}$, $A_2=\{\omega_3, \omega_4\}$. If $p(c_1)=0.5=p(c_2)$ then $P(\omega)=(0.25, 0.25, 0.25, 0.25)$ as well as $P(\omega)=(0.2, 0.3, 0.4, 0.1)$ are both possible as probability distributions over $\Omega$. In fact, there are uncountably infinite probability distributions that are compatible with the given $p$. Moreover, we also don’t know $P(c|\omega)$ that the sender uses. - Since we talk about the frequentist probability on $c_i$, let’s take a look at an example trial where we sampled some $c=c_i$, then for any $A\subseteq \Omega$, by looking at $A_i$, we can either 1. prove that $A$ <span style="color:green">occurs</span> $\Leftrightarrow A_i \subseteq A$ 2. $A$ <span style="color:red">does *not* occur </span> $\Leftrightarrow$ or prove that $A^c$ occurs $\Leftrightarrow$ $A_i\subseteq A^c$ 3. or <span style="color:grey">cannot prove either </span> of the above two $\Leftrightarrow$ $A_i\cap A^c\not=\emptyset$ and $A_i\cap A\not=\emptyset$. - Together these three exhaust the space of statements we can make. So for any trail we should be able to make exactly one of these statements for any $A\subseteq \Omega$. ![[Dempsters rule space of statements.excalidraw.dark.png|400]] - The presence of a probability distribution/frequencies $p$ provides meaning to belief $Bel$ and plausibility $Pl$. Specifically, for any $A\subseteq \Omega$, $Bel(A)=\sum_{A_i\subseteq A} p(c_i), ~~~~~Pl(A)=\sum_{A_i~|\,A_i \cap A\not=\emptyset} p(c_i)=$meaning that the belief of a subset $A$, is the total chance, or total proportion of times (in the frequentist sense, if we were to sample codes) that we will be able to prove with certainty that the actual message lies in $A$, i.e. $A$ occurs. And the plausibility on the other hand, is the total proportion of times we will *not* be able to prove that the message does *not* lie in $A$ (thus making it *plausible* that it lies in $A$). To prove that the message does not lie in $A$ we need to prove that it lies in $A^c$ and we can’t prove that the message will lie in $A^c$ if $A_i$ is not a subset of $A^c$. So $Pl(A)=p(A_i\not\subseteq A^c) \,=p(A_i\cap A\not=\emptyset) = 1-p(A_i\subseteq A) = 1-Bel(A^c)$. - Now assume that there is a back-up sender who has another set of codes $C'=\{c_i'\}$. The back-up sender also knows the message that needs to be sent and sends a compatible code from $C'$ accordingly. Again, we don’t know $P(c'\,|\,\omega)$. - **The “as independent as they can be assumption” behind Dempster’s rule!**: At the risk of sounding a little vague, I want to say that we want to assume that the random variables $c$ and $c'$ are as independent as possible. Notice that the random variables $c$ and $c'$ cannot be completely independent because their value depends on the message $\omega$ that needs to be sent. This imposes a constraint on $P(c|c')$ and $P(c|c')$. Concretely, seeing a value of $c_i'$ excludes $\mathcal S_i= \{c_i\in C\,|\,A_i \subseteq (A_i')^c\}$, meaning $P(S_i |c_i')=0$, and vice-versa for $c_i$ . What we are assuming is that the senders are not aware of each other, and beyond this constraint of true $\omega$, there is no other information about $c'$ in $c$ and vice-versa. - An alternate way to think about this assumption is the following. Say both the senders sample their codes using whatever distribution they like, independent of each other and even independent of the true message $\omega$. The true message $\omega$ is then sampled independently. Both the senders are only allowed to send their codes if both the sampled codes are compatible with the message, otherwise the entire sample $(c, c', \omega)$ is discarded. - 💡 Take a moment to think about what this means for the robustness of a model w.r.t irrelevant perturbations for input feature distribution. If the codes were not independent, we would need to keep them in a product space and operate using a probability distribution in the product space. Also, due to the underlying assumption that both the codes contain the true message the sender wanted to send, the decoding algorithm by design will decode jointly by taking an intersection of the sets represented by both codes. So the probability mass of a pair of codes $(c_i, c_i')$ will go to $A_i \cap A_i'$. This is still very much like the Dempster’s rule but the main advantage of Dempster’s rule, i.e., independent processes is gone. We will need joint data to determine the joint probability distribution over both the codes. ## Concrete example ### Message Space $\Omega$ - Latte - Cappuccino - Green Tea - Mocha ### Codes $C$ - $c_1$: Hot drink ($A_1 = \Omega$) - $c_2$: Not coffee-based ($A_2 = \{\text{Green Tea}\}$) - $c_3$: Contains milk ($A_3 = \{\text{Latte, Cappuccino, Mocha}\}$) ### Back-up Sender’s Codes $C'$ - $c_1'$: Contains caffeine ($A_1' = \{\text{Latte, Cappuccino, Mocha}\}$) - $c_2'$: Served in a large cup ($A_2' = \{\text{Latte, Green Tea}\}$) - $c_3'$: Dark-colored drink ($A_3' = \{\text{Mocha}\}$) ### Scenario 1. Primary sender wants “Latte” and sends $c_3$. 2. Back-up sender sends $c_2'$. 3. Receiver decodes $c_3$ to $A_3$ and $c_2'$ to $A_2'$. 4. Receiver deduces “Latte” as it’s common in both $A_3$ and $A_2'$.