Skip to content

Instantly share code, notes, and snippets.

@justinabsentia
Forked from ruvnet/Liar-Ai.md
Created February 9, 2025 05:22
Show Gist options
  • Select an option

  • Save justinabsentia/cf2774857e6569bf868be9015a441341 to your computer and use it in GitHub Desktop.

Select an option

Save justinabsentia/cf2774857e6569bf868be9015a441341 to your computer and use it in GitHub Desktop.

Revisions

  1. @ruvnet ruvnet revised this gist Feb 8, 2025. 1 changed file with 624 additions and 808 deletions.
    1,432 changes: 624 additions & 808 deletions notebook.ipynb
    Original file line number Diff line number Diff line change
    @@ -1,811 +1,627 @@
    {
    "nbformat": 4,
    "nbformat_minor": 0,
    "metadata": {
    "colab": {
    "name": "MultiModal_LieDetection_ReAct_Tutorial.ipynb"
    },
    "kernelspec": {
    "display_name": "Python 3",
    "name": "python3"
    }
    "cells": [
    {
    "cell_type": "markdown",
    "id": "9a321869",
    "metadata": {},
    "source": [
    "# Multi-Modal Lie Detection with GSPO-enhanced ReAct Reasoning\n",
    "\n",
    "This notebook demonstrates a multi-modal deception detection system that integrates multiple data sources (video, audio, text, and more) with an advanced reasoning framework. The system uses **GSPO-enhanced ReAct** reasoning, combining self-play reinforcement learning and a reasoning-action loop for improved decision-making. It emphasizes transparency, explainability, and ethical considerations in AI-driven lie detection."
    ]
    },
    "cells": [
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "# Multi-Modal Lie Detection with ReAct: A Step-by-Step Tutorial\n",
    "In this tutorial, we implement a multi-modal lie detection system that analyzes **vision**, **audio**, **text**, and optionally **physiological** signals. By using an agent-based approach called **ReAct** (Reasoning + Acting), the system can reason about its outputs and involve humans in the loop for validation. We will cover everything from installing requirements to evaluating performance, while emphasizing privacy and ethical use.\n",
    "\n",
    "**Overview**:\n",
    "- *Installation & Setup*: Prepare the environment (Google Colab and Drive integration).\n",
    "- *Project Overview*: Understand multi-modal deception detection and the ReAct reasoning framework.\n",
    "- *Model Implementations*: Build models for facial cues, vocal stress, and text analysis, optionally including physiological data, and combine them.\n",
    "- *Interactive Features*: Use widgets and user input to incorporate human feedback and explain model decisions.\n",
    "- *Inference & Real-Time Processing*: Run the lie detector on sample inputs (video, audio, text) and simulate real-time usage.\n",
    "- *Testing & Evaluation*: Verify model components with tests, and evaluate accuracy, precision/recall, AUC, etc.\n",
    "- *Ethical Considerations*: Address bias, privacy, legal compliance (e.g., GDPR, EU AI Act), and responsible deployment practices.\n",
    "\n",
    "Let's get started!"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 1. Installation & Setup\n",
    "\n",
    "First, we need to install the required libraries and set up our environment. This notebook is designed for **Google Colab** for ease of use. It will also demonstrate how to integrate with **Google Drive** if you want to save or load data (like videos or models).\n",
    "\n",
    "**Dependencies**:\n",
    "- `torch` (PyTorch) for building deep learning models.\n",
    "- `transformers` (HuggingFace) for NLP models.\n",
    "- `opencv-python` for image and video processing.\n",
    "- `librosa` for audio processing.\n",
    "- `shap` and `lime` for explainability (optional).\n",
    "- `ipywidgets` for interactive widgets.\n",
    "- `scikit-learn` for evaluation metrics (optional).\n",
    "\n",
    "We'll also ensure we have access to GPU (if available) for faster computations and mount Google Drive for data storage."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "!pip install transformers opencv-python librosa shap lime scikit-learn ipywidgets"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "*Note:* If using Google Drive to store or retrieve data, you can mount it here:"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "from google.colab import drive\n",
    "drive.mount('/content/drive')"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "import torch\n",
    "print(\"Torch version:\", torch.__version__)\n",
    "print(\"GPU available:\", torch.cuda.is_available())"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 2. Project Overview\n"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Multi-Modal Deception Detection\n",
    "Combining multiple data modalities can improve the accuracy of lie detection by capturing different cues​:contentReference[oaicite:0]{index=0}. Traditional lie detection often relies on a single source like physiological signals (e.g., polygraph measurements), which is not very reliable​:contentReference[oaicite:1]{index=1}. In a multi-modal system, we analyze **facial expressions**, **voice tone**, **spoken or written text**, and even **physiological sensors** together. Each modality may provide unique indicators of stress or deceit:\n",
    "- **Vision**: Micro-expressions, eye movements, and body language (e.g., fidgeting) could suggest discomfort associated with lying.\n",
    "- **Audio**: Changes in pitch, tone, speech rate, or hesitation in voice can be signs of stress.\n",
    "- **Text**: Linguistic cues such as choice of words, sentiment, or contradictions in a story might indicate deception.\n",
    "- **Physiological**: Heart rate, skin conductance (sweating), etc., can reflect nervousness.\n",
    "\n",
    "By fusing these signals, the system reduces uncertainty from any single source and makes a more informed judgment​:contentReference[oaicite:2]{index=2}. Research has shown that integrating verbal and nonverbal cues improves detection performance compared to unimodal approaches​:contentReference[oaicite:3]{index=3}​:contentReference[oaicite:4]{index=4}."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### ReAct Reasoning and Agentic Decisions\n",
    "Instead of a black-box classifier, our system uses an **agent** that can reason about the inputs and its own outputs. We adopt the **ReAct (Reasoning + Acting)** framework, where the AI agent alternates between reasoning steps and actions​:contentReference[oaicite:5]{index=5}. In practice, this means the model will:\n",
    "1. **Reason**: Internally analyze the evidence (e.g., *\"Facial cues suggest stress, but vocal analysis is moderate\"*​:contentReference[oaicite:6]{index=6}).\n",
    "2. **Act**: Take an action based on that analysis (e.g., *decide to gather more information* or *flag for human review*).\n",
    "3. Repeat this reasoning-action loop, refining the decision with each step​:contentReference[oaicite:7]{index=7}.\n",
    "\n",
    "This agentic approach allows the system to not only output a prediction (truth or lie) but also an explanation of how it arrived there. The agent can use **recursive decision-making** – revisiting its conclusions if new evidence or actions suggest something different – and even use simple **reinforcement learning** techniques to improve over time​:contentReference[oaicite:8]{index=8}​:contentReference[oaicite:9]{index=9}. For example, the agent could learn from mistakes (with human feedback) and adjust its strategy in future interactions."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Human-in-the-Loop and Privacy\n",
    "To make the system reliable and responsible, we include a **human-in-the-loop** at critical points. This means a human (e.g., an investigator or analyst) can:\n",
    "- Review cases where the AI is uncertain or the modalities disagree.\n",
    "- Override the AI's decision if it seems incorrect.\n",
    "- Provide feedback that the AI uses to improve (a form of supervised reinforcement learning on mistakes).\n",
    "\n",
    "For instance, if facial and audio cues conflict strongly, the system can automatically flag the interview for human review instead of making a hard judgment​:contentReference[oaicite:10]{index=10}​:contentReference[oaicite:11]{index=11}. We will see later how the notebook can prompt for human input in such cases.\n",
    "\n",
    "**Privacy Considerations**: Because this system deals with sensitive biometric data (faces, voice recordings, heart rates, etc.), it is designed with privacy in mind. Data can be processed **on-device** or in a secure environment to avoid sending personal data to external servers​:contentReference[oaicite:12]{index=12}. Techniques like data anonymization and encryption are applied where possible. Under regulations like the **GDPR**, biometric data is considered highly sensitive and requires robust protection​:contentReference[oaicite:13]{index=13}. Therefore, any real deployment must ensure user consent is obtained and that data storage complies with privacy laws. In our demo, all data stays local to your Colab session or Google Drive to respect privacy."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 3. Model Implementations\n",
    "\n",
    "Now, we will implement the models for each modality and then create a fusion mechanism and the ReAct-based agent. For simplicity, we'll use relatively simple models and simulated data (since training a full model here is beyond scope). The focus is on the architecture and how these components interact, rather than achieving state-of-the-art accuracy.\n",
    "\n",
    "We'll implement the following:\n",
    "- **Vision Model**: a CNN to analyze facial video frames.\n",
    "- **Audio Model**: an LSTM-based model to analyze speech.\n",
    "- **Text Model**: a Transformer-based or simplified model to analyze transcript text.\n",
    "- **Physiological Model** (optional): a placeholder for handling sensor data (if available).\n",
    "- **Fusion Model**: a strategy to combine outputs from all modalities.\n",
    "- **ReAct Agent**: an agent that uses the fused results and reasoning rules to decide lie/truth and produce an explanation.\n",
    "\n",
    "Let's proceed step-by-step through each component."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Vision Model (Facial Analysis)\n",
    "For the vision modality, we'll use a Convolutional Neural Network (CNN) to extract facial cues. The model could analyze facial expressions or micro-expressions from video frames​:contentReference[oaicite:14]{index=14}. In practice, one might use a pre-trained model (like ResNet50) fine-tuned on emotion or expression datasets for subtle indicators of deceit​:contentReference[oaicite:15]{index=15}. Here, we'll build a simple CNN from scratch for demonstration.\n",
    "\n",
    "**Approach**:\n",
    "- We assume video frames or images of the subject are available.\n",
    "- We preprocess each frame (resize, normalize) and feed it into the CNN.\n",
    "- The CNN outputs a probability distribution over two classes: \"Truth\" vs \"Lie\".\n",
    "- For example, a tense facial expression or avoidance of eye contact might push the prediction towards \"Lie\".\n",
    "\n",
    "We'll implement a small CNN with a couple of convolutional layers and a final output layer with 2 neurons (for the two classes). No training is performed here; we'll use random weights to illustrate the pipeline."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Vision Model Implementation (CNN)\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "\n",
    "class VisionModel(nn.Module):\n",
    " def __init__(self):\n",
    " super(VisionModel, self).__init__()\n",
    " # Simple CNN: conv layers followed by a fully connected layer\n",
    " self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1) # downsample by 2\n",
    " self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=2, padding=1) # downsample further\n",
    " self.conv3 = nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1)\n",
    " self.fc = nn.Linear(32 * 8 * 8, 2) # assuming input frames 64x64 -> after 3 strides of 2 => 8x8 feature map\n",
    " def forward(self, x):\n",
    " x = F.relu(self.conv1(x))\n",
    " x = F.relu(self.conv2(x))\n",
    " x = F.relu(self.conv3(x))\n",
    " x = x.view(x.size(0), -1)\n",
    " x = self.fc(x)\n",
    " # Output as probabilities for [Truth, Lie]\n",
    " return torch.softmax(x, dim=1)\n",
    "\n",
    "# Instantiate the model and test on a dummy input\n",
    "vision_model = VisionModel()\n",
    "dummy_frame = torch.randn(1, 3, 64, 64) # batch of 1, 64x64 RGB image\n",
    "dummy_out = vision_model(dummy_frame)\n",
    "print(\"Vision model output (Truth,Lie probabilities):\", dummy_out.detach().numpy())"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Audio Model (Speech Analysis)\n",
    "For the audio modality, we analyze the speaker's voice. Signs of stress or deception can manifest as changes in vocal pitch, tone, pace, or disfluencies (ums, pauses)​:contentReference[oaicite:16]{index=16}. A common approach is to extract acoustic features (e.g., MFCCs, spectrograms) and use a sequence model to capture temporal patterns.\n",
    "\n",
    "We will implement an LSTM-based model that takes extracted features from the audio waveform and outputs a probability of truth/lie. In practice, one could use a pre-trained audio model like **Wav2Vec 2.0** for richer representations​:contentReference[oaicite:17]{index=17}, but here we'll keep it simple:\n",
    "- Use `librosa` to extract MFCC features from an audio sample.\n",
    "- Feed the sequence of MFCC vectors into an LSTM.\n",
    "- Use the final LSTM output (or hidden state) to classify lie vs truth.\n",
    "\n",
    "This model should capture things like elevated pitch or irregular pauses which might correlate with lying."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Audio Model Implementation (LSTM)\n",
    "import torch.nn as nn\n",
    "\n",
    "class AudioModel(nn.Module):\n",
    " def __init__(self, input_dim=13, hidden_dim=32):\n",
    " super(AudioModel, self).__init__()\n",
    " self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)\n",
    " self.fc = nn.Linear(hidden_dim, 2)\n",
    " def forward(self, x):\n",
    " # x shape: (batch, seq_len, input_dim)\n",
    " lstm_out, (h, c) = self.lstm(x)\n",
    " # Use last hidden state\n",
    " last_hidden = h[-1] # shape (batch, hidden_dim)\n",
    " out = self.fc(last_hidden)\n",
    " return torch.softmax(out, dim=1)\n",
    "\n",
    "audio_model = AudioModel()\n",
    "# Generate a dummy audio feature sequence (e.g., 50 time steps of 13-dim MFCCs)\n",
    "dummy_audio = torch.randn(1, 50, 13)\n",
    "dummy_audio_out = audio_model(dummy_audio)\n",
    "print(\"Audio model output (Truth,Lie probabilities):\", dummy_audio_out.detach().numpy())"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Text Model (Language Analysis)\n",
    "The text modality examines what the person is saying (or writing). Linguistic patterns can reveal deception – liars might use fewer first-person pronouns, or add certain qualifying phrases, etc. Modern approaches use Transformer-based models like BERT or RoBERTa to classify text as truthful or deceptive​:contentReference[oaicite:18]{index=18}​:contentReference[oaicite:19]{index=19}.\n",
    "\n",
    "To keep things simple, we'll implement a placeholder text model. For demonstration, we might use a basic keyword-based heuristic or a simple logistic model. (In a real system, you would fine-tune a pretrained transformer on a deception dataset​:contentReference[oaicite:20]{index=20}.)\n",
    "\n",
    "Our simplified text model will:\n",
    "- Take a transcript or statement as input (string).\n",
    "- Output a probability of lie/truth.\n",
    "- *(For demonstration, we'll use a trivial rule: if the statement contains negation words like \"not\", \"never\", we might lean towards \"lie\" to simulate detecting a denial. This is just a placeholder logic.)*"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Text Model Implementation (simplified)\n",
    "import numpy as np\n",
    "\n",
    "class TextModel:\n",
    " def __init__(self):\n",
    " # Example keywords indicative of deception (very naive approach):\n",
    " self.deception_keywords = {\"not\", \"never\", \"didn't\", \"cannot\"}\n",
    " def predict(self, text):\n",
    " \"\"\"Return a probability tensor [p_truth, p_lie] based on the presence of keywords.\"\"\"\n",
    " text_lower = text.lower()\n",
    " # Naive rule: if any deception keyword is present, assign higher lie probability\n",
    " lie_prob = 0.7 if any(word in text_lower for word in self.deception_keywords) else 0.3\n",
    " truth_prob = 1 - lie_prob\n",
    " probs = torch.tensor([[truth_prob, lie_prob]])\n",
    " return probs\n",
    "\n",
    "text_model = TextModel()\n",
    "# Test the text model with example inputs\n",
    "for example in [\"I was at home all evening.\", \"I did not take the money.\"]:\n",
    " out = text_model.predict(example)\n",
    " print(f\"Text: '{example}' -> Output (Truth,Lie):\", out.numpy())"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### (Optional) Physiological Model\n",
    "In some scenarios, we might have physiological data such as heart rate, skin conductance (GSR), or blood pressure. These signals can indicate stress levels (as used in a traditional polygraph)​:contentReference[oaicite:21]{index=21}. Integrating such data can provide additional clues to deception.\n",
    "\n",
    "For the scope of this tutorial, we will not implement a full physiological model, but here's how it could be handled:\n",
    "- If sensor data is available (e.g., a sequence of heart rate measurements during questioning), you could use a simple threshold model or a small neural network to detect anomalies.\n",
    "- For example, a sudden spike in heart rate or GSR could be interpreted as increased stress.\n",
    "- This model would output a probability of deception similar to the others.\n",
    "\n",
    "In our code, we'll assume we don't have this modality available. If you did, you would process it and include it in the fusion step just like the others."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Fusion Model (Integrating Modalities)\n",
    "After obtaining predictions from each modality, we need to combine them into a single decision. There are different fusion strategies​:contentReference[oaicite:22]{index=22}:\n",
    "- **Early Fusion**: combining raw features from all modalities and then classify (requires joint training).\n",
    "- **Late Fusion**: each modality gives an independent judgment (e.g., a probability of deception), and we combine those judgments (e.g., via averaging or a meta-classifier).\n",
    "- **Hybrid Fusion**: use a more complex model (like attention) to weight modalities dynamically​:contentReference[oaicite:23]{index=23}.\n",
    "\n",
    "We will implement a simple late fusion approach​:contentReference[oaicite:24]{index=24}: take the average of the \"lie\" probabilities from the vision, audio, and text models. This assumes each modality is equally important (which may not be true in all cases, but it's a simple and effective starting point).\n",
    "\n",
    "The fusion model will output a combined probability for truth/lie. We can then set a threshold (e.g., 0.5) on this combined probability to make the final classification."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Fusion function to combine modality outputs\n",
    "def fuse_predictions(predictions):\n",
    " \"\"\"\n",
    " Combine predictions from modalities.\n",
    " `predictions` is a list of [p_truth, p_lie] from each available modality.\n",
    " Returns a fused [p_truth, p_lie] list.\n",
    " \"\"\"\n",
    " preds = np.array(predictions)\n",
    " avg_probs = preds.mean(axis=0)\n",
    " # Ensure it sums to 1 (should already, if each pred is probabilities)\n",
    " avg_probs = avg_probs / avg_probs.sum()\n",
    " return avg_probs.tolist()\n",
    "\n",
    "# Example: fuse dummy outputs from the models\n",
    "vision_dummy = dummy_out.squeeze().tolist()\n",
    "audio_dummy = dummy_audio_out.squeeze().tolist()\n",
    "text_dummy = text_model.predict(\"Just a harmless example.\").squeeze().tolist()\n",
    "fused_dummy = fuse_predictions([vision_dummy, audio_dummy, text_dummy])\n",
    "print(\"Fused output (Truth,Lie probabilities):\", fused_dummy)"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### ReAct Agent (Reasoning and Action)\n",
    "Now we build the central **ReAct agent** that uses the outputs of all modalities and makes a final decision with reasoning. The agent will mimic a decision-making process:\n",
    "1. It looks at the inputs from each model (vision, audio, text, etc.).\n",
    "2. It generates a reasoning trace, e.g. notes if one modality strongly indicates \"lie\" while another indicates \"truth\".\n",
    "3. If there's disagreement or low confidence, it can decide to label the result as uncertain and ask for human input​:contentReference[oaicite:25]{index=25}.\n",
    "4. Otherwise, it makes a final call (truth or lie) and provides an explanation of how it reached that conclusion.\n",
    "\n",
    "In a real implementation, the agent could incorporate business rules or even a small reinforcement learning model to optimize its questioning strategy. We can also add a **neuro-symbolic** layer: for example, a rule like *\"If text content contradicts facial emotion, increase the deception probability\"*​:contentReference[oaicite:26]{index=26}.\n",
    "\n",
    "Our ReAct agent here will be rule-based for clarity:\n",
    "- If all modalities agree (all high lie or all low lie probability), take that as the decision.\n",
    "- If they conflict, the agent may either choose the majority or mark the result as \"Uncertain\" and suggest human review.\n",
    "- It will produce a reasoning log of its steps."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Agent Implementation using ReAct reasoning\n",
    "class LieDetectionAgent:\n",
    " def __init__(self, lie_threshold=0.5, conflict_threshold=0.2):\n",
    " \"\"\"\n",
    " lie_threshold: probability above which a modality votes 'Lie'.\n",
    " conflict_threshold: if difference between max and min lie probabilities is above this, flag conflict.\n",
    " \"\"\"\n",
    " self.lie_threshold = lie_threshold\n",
    " self.conflict_threshold = conflict_threshold\n",
    " \n",
    " def analyze(self, vision_pred, audio_pred, text_pred):\n",
    " \"\"\"\n",
    " Analyze the predictions from each modality.\n",
    " vision_pred, audio_pred, text_pred are each a list or tensor [p_truth, p_lie].\n",
    " Returns (final_decision, reasoning_trace).\n",
    " \"\"\"\n",
    " vision_pred = vision_pred if isinstance(vision_pred, list) else vision_pred.squeeze().tolist()\n",
    " audio_pred = audio_pred if isinstance(audio_pred, list) else audio_pred.squeeze().tolist()\n",
    " text_pred = text_pred if isinstance(text_pred, list) else text_pred.squeeze().tolist()\n",
    " modality_preds = {\n",
    " \"Vision\": vision_pred,\n",
    " \"Audio\": audio_pred,\n",
    " \"Text\": text_pred\n",
    " }\n",
    " reasoning_trace = []\n",
    " lie_probs = {}\n",
    " # Note each modality's lie probability\n",
    " for mod, pred in modality_preds.items():\n",
    " lie_prob = pred[1]\n",
    " lie_probs[mod] = lie_prob\n",
    " reasoning_trace.append(f\"{mod} model indicates lie probability = {lie_prob:.2f}.\")\n",
    " \n",
    " # Check for agreement or conflict\n",
    " max_mod = max(lie_probs, key=lie_probs.get)\n",
    " min_mod = min(lie_probs, key=lie_probs.get)\n",
    " max_prob = lie_probs[max_mod]\n",
    " min_prob = lie_probs[min_mod]\n",
    " if max_prob - min_prob > self.conflict_threshold:\n",
    " reasoning_trace.append(f\"High disagreement detected between modalities (range {min_prob:.2f}-{max_prob:.2f}).\")\n",
    " conflict = True\n",
    " else:\n",
    " conflict = False\n",
    " \n",
    " # Determine final decision based on average\n",
    " avg_lie_prob = sum(lie_probs.values()) / len(lie_probs)\n",
    " if avg_lie_prob > self.lie_threshold:\n",
    " final_decision = \"Lie\"\n",
    " else:\n",
    " final_decision = \"Truth\"\n",
    " \n",
    " reasoning_trace.append(f\"Average lie probability = {avg_lie_prob:.2f}, hence system verdict = '{final_decision}'.\")\n",
    " \n",
    " # If conflict, recommend human review\n",
    " if conflict:\n",
    " reasoning_trace.append(\"Modalities are inconsistent; flagging for human review.\")\n",
    " final_decision = final_decision + \" (Uncertain, needs human verification)\"\n",
    " \n",
    " return final_decision, reasoning_trace\n",
    "\n",
    "# Instantiate the agent\n",
    "agent = LieDetectionAgent()\n",
    "# Test agent with dummy predictions\n",
    "test_decision, test_trace = agent.analyze(vision_dummy, audio_dummy, text_dummy)\n",
    "print(\"Agent reasoning trace (demo):\")\n",
    "for line in test_trace:\n",
    " print(\"-\", line)\n",
    "print(\"Agent decision:\", test_decision)"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 4. Interactive Features\n",
    "\n",
    "Interactivity is key for a human-centered lie detection system. In this section, we'll discuss:\n",
    "- Uploading and processing user data (video, audio, text input).\n",
    "- Involving a human operator to validate or correct the AI's decisions.\n",
    "- Using explainability techniques to interpret model predictions."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Uploading Video/Audio/Text Data\n",
    "To test our system, we need to provide input data. In a Colab environment, you can upload files or use files stored in Google Drive.\n",
    "\n",
    "Below are examples of how to upload a video file and an audio file in Colab (you'll be prompted to choose files). Then we also take a text input as the transcript:"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "from google.colab import files\n",
    "\n",
    "# Upload a video file (e.g., .mp4)\n",
    "print(\"Please upload a video file for analysis:\")\n",
    "video_upload = files.upload()\n",
    "if video_upload:\n",
    " video_path = next(iter(video_upload))\n",
    " print(f\"Uploaded video: {video_path}\")\n",
    "\n",
    "# Upload an audio file (e.g., .wav)\n",
    "print(\"Please upload an audio file for analysis:\")\n",
    "audio_upload = files.upload()\n",
    "if audio_upload:\n",
    " audio_path = next(iter(audio_upload))\n",
    " print(f\"Uploaded audio: {audio_path}\")\n",
    "\n",
    "# Get text input (transcript)\n",
    "transcript = input(\"Enter the transcript or statement to analyze (or leave empty if not available): \")\n",
    "if transcript == \"\":\n",
    " transcript = \"No transcript provided.\"\n",
    " print(\"Using default text:\", transcript)\n",
    "else:\n",
    " print(\"Transcript received.\")"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Human-in-the-Loop Validation\n",
    "In a real deployment, whenever the AI system is unsure or just as a regular policy, a human should review the results. We can simulate this in the notebook. For instance, after the agent makes a prediction, we can ask the user (human) to confirm or correct it.\n",
    "\n",
    "We will integrate a step where the system's decision is presented, and the user can input whether they agree or if they want to override the decision. This could also be done with interactive widgets (like buttons or dropdowns) for a more user-friendly UI.\n",
    "\n",
    "### Explainability Tools\n",
    "To build trust, it's important to explain why the AI made a certain decision:\n",
    "- **SHAP and LIME** can highlight which features or words influenced the models' predictions.\n",
    "- **Grad-CAM** can show which regions of a video frame the CNN focused on when predicting \"lie\".\n",
    "- **Attention visualization** in transformers can show which words in the text were considered most important.\n",
    "\n",
    "For example, let's use LIME to explain the text model's decision for a sample input. We will see what words influence the model's output (remember, our text model is very simple, so this is just to demonstrate the process)."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Install and use LIME for explainability on text model\n",
    "!pip install --quiet lime\n",
    "from lime.lime_text import LimeTextExplainer\n",
    "\n",
    "# Define a predict function for our text model that LIME can call\n",
    "class_names = [\"Truth\", \"Lie\"]\n",
    "def text_model_predict(texts):\n",
    " results = []\n",
    " for t in texts:\n",
    " probs = text_model.predict(t).detach().numpy()[0]\n",
    " results.append(probs)\n",
    " return np.vstack(results)\n",
    "\n",
    "explainer = LimeTextExplainer(class_names=class_names)\n",
    "sample_text = \"Honestly, I did not steal anything.\"\n",
    "exp = explainer.explain_instance(sample_text, text_model_predict, num_features=6)\n",
    "print(\"LIME explanation for text:\\n\")\n",
    "for feature, weight in exp.as_list():\n",
    " print(f\"{feature}: {weight:.3f}\")"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "*Interpretation:* In the above output, LIME lists the words and their influence on the prediction. A positive weight indicates the word contributes to predicting \"Lie\", while a negative weight would support \"Truth\". We can see which keywords our simple model is relying on (for example, \"did\" or \"not\" might appear with positive weights since our model keys off negation). In a more advanced model, this helps identify important linguistic features."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 5. Inference & Real-Time Processing\n",
    "\n",
    "Now that all components are ready, let's run the lie detection system on some input data. We will use the data you provided (video, audio, text) in the previous step. The pipeline is:\n",
    "1. **Vision**: Read the video file, extract a frame (or frames) and get the vision model's prediction.\n",
    "2. **Audio**: Read the audio file, extract features, get the audio model's prediction.\n",
    "3. **Text**: Take the input transcript text and get the text model's prediction.\n",
    "4. **Fusion**: Combine the predictions from all available modalities.\n",
    "5. **Agent Decision**: Let the ReAct agent analyze the combined evidence and make a final decision (with a reasoning trace).\n",
    "6. **Human Verification** (optional): Allow a human to approve or override the decision.\n",
    "7. **Real-Time Considerations**: (Discussion) how to extend this to real-time analysis.\n",
    "\n",
    "Let's go through these steps."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "import cv2\n",
    "\n",
    "vision_pred = None\n",
    "if 'video_path' in locals() and video_path:\n",
    " cap = cv2.VideoCapture(video_path)\n",
    " success, frame = cap.read()\n",
    " cap.release()\n",
    " if success:\n",
    " # Preprocess the frame for the model\n",
    " frame_resized = cv2.resize(frame, (64, 64))\n",
    " frame_rgb = cv2.cvtColor(frame_resized, cv2.COLOR_BGR2RGB)\n",
    " frame_tensor = torch.from_numpy(frame_rgb).permute(2, 0, 1).unsqueeze(0).float() / 255.0\n",
    " with torch.no_grad():\n",
    " vision_out = vision_model(frame_tensor)\n",
    " vision_pred = vision_out.squeeze().tolist()\n",
    " print(f\"Vision model prediction (Truth,Lie): {vision_pred}\")\n",
    " else:\n",
    " print(\"Failed to read video frame. Vision model will be skipped.\")\n",
    "else:\n",
    " print(\"No video provided. Skipping vision analysis.\")"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "import librosa\n",
    "\n",
    "audio_pred = None\n",
    "if 'audio_path' in locals() and audio_path:\n",
    " y, sr = librosa.load(audio_path, sr=None, mono=True, duration=10)\n",
    " if y is not None:\n",
    " mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)\n",
    " mfcc = mfcc.T # shape (time_steps, 13)\n",
    " mfcc_tensor = torch.from_numpy(mfcc).unsqueeze(0).float()\n",
    " with torch.no_grad():\n",
    " audio_out = audio_model(mfcc_tensor)\n",
    " audio_pred = audio_out.squeeze().tolist()\n",
    " print(f\"Audio model prediction (Truth,Lie): {audio_pred}\")\n",
    " else:\n",
    " print(\"Could not load audio or audio is empty. Skipping audio analysis.\")\n",
    "else:\n",
    " print(\"No audio provided. Skipping audio analysis.\")"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Text analysis and Fusion\n",
    "text_pred = None\n",
    "if 'transcript' in locals() and transcript is not None:\n",
    " with torch.no_grad():\n",
    " text_out = text_model.predict(transcript)\n",
    " text_pred = text_out.squeeze().tolist()\n",
    " print(f\"Text model prediction (Truth,Lie): {text_pred}\")\n",
    "else:\n",
    " print(\"No text transcript provided. Skipping text analysis.\")\n",
    "\n",
    "# Combine available modality predictions\n",
    "available_preds = []\n",
    "if vision_pred is not None:\n",
    " available_preds.append(vision_pred)\n",
    "if audio_pred is not None:\n",
    " available_preds.append(audio_pred)\n",
    "if text_pred is not None:\n",
    " available_preds.append(text_pred)\n",
    "\n",
    "if available_preds:\n",
    " fused_pred = fuse_predictions(available_preds)\n",
    " print(\"Fused prediction (Truth,Lie):\", fused_pred)\n",
    "else:\n",
    " fused_pred = [0.5, 0.5]\n",
    " print(\"No modalities available to fuse. Defaulting to [0.5, 0.5].\")"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Agent decision\n",
    "final_decision, reasoning_trace = agent.analyze(\n",
    " vision_pred if vision_pred is not None else [1,0],\n",
    " audio_pred if audio_pred is not None else [1,0],\n",
    " text_pred if text_pred is not None else [1,0]\n",
    ")\n",
    "print(\"\\nAgent's reasoning trace:\")\n",
    "for line in reasoning_trace:\n",
    " print(\"*\", line)\n",
    "print(\"Agent's preliminary decision:\", final_decision)\n",
    "\n",
    "# Human-in-the-loop: ask user to approve or override\n",
    "user_feedback = input(\"Do you agree with this decision? (yes/no) \")\n",
    "if user_feedback.strip().lower() in [\"no\", \"n\"]:\n",
    " correct_label = input(\"Please enter the correct label ('Truth' or 'Lie'): \")\n",
    " print(f\"Human override: The correct label is '{correct_label}'.\")\n",
    " final_label = correct_label\n",
    "else:\n",
    " final_label = final_decision\n",
    " print(\"Decision accepted by human.\")\n",
    "\n",
    "print(\"\\nFinal decision (after human verification):\", final_label)"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "**Real-Time Use**: The above pipeline processes one batch of inputs. For real-time deception detection (e.g., during a live interview), you would continuously capture data and feed it to the models in a loop. For example:\n",
    "- Use a webcam feed to get frames and run the vision model on each (or every Nth) frame.\n",
    "- Stream audio input through the audio model in chunks.\n",
    "- Continuously update the transcript (if doing real-time speech-to-text) and analyze text segments.\n",
    "\n",
    "Such streaming implementation would require optimizing the models for speed and perhaps using asynchronous processing. However, the core steps remain similar to what we ran above. Additionally, the system should log each interaction (inputs, model outputs, reasoning) for audit and improvement​:contentReference[oaicite:27]{index=27}."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 6. Testing & Evaluation\n",
    "\n",
    "Building confidence in the system requires thorough testing and evaluation. We should test each component in isolation (unit tests) and the system as a whole (integration tests), and evaluate performance on collected data."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Unit Tests for Components\n",
    "We can write simple tests to ensure each model behaves as expected. For example, check that the VisionModel returns a probability tensor of the correct shape for a given image, or that the agent returns a decision string."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Unit testing each component (simple examples)\n",
    "# Test VisionModel output shape\n",
    "test_img = torch.randn(1, 3, 64, 64)\n",
    "assert vision_model(test_img).shape == (1, 2)\n",
    "print(\"VisionModel unit test passed (output shape is 1x2).\")\n",
    "\n",
    "# Test AudioModel output shape\n",
    "test_audio = torch.randn(1, 10, 13) # 10 time steps of MFCC\n",
    "assert audio_model(test_audio).shape == (1, 2)\n",
    "print(\"AudioModel unit test passed (output shape is 1x2).\")\n",
    "\n",
    "# Test TextModel output type\n",
    "test_text_out = text_model.predict(\"This is a test.\")\n",
    "assert isinstance(test_text_out, torch.Tensor) and test_text_out.shape == (1, 2)\n",
    "print(\"TextModel unit test passed (output shape is 1x2).\")\n",
    "\n",
    "# Test Agent decision output\n",
    "dec, trace = agent.analyze([1,0], [1,0], [1,0]) # all modalities saying 'Truth'\n",
    "assert dec.startswith(\"Truth\")\n",
    "print(\"Agent unit test passed (agent returns a decision string).\")"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Performance Evaluation\n",
    "With a real dataset of labeled truthful and deceptive instances, we would train the models and then evaluate metrics like accuracy, precision, recall, and AUC (area under the ROC curve).\n",
    "\n",
    "For example, if we had arrays of true labels and predicted labels for a test set, we could do:"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "import numpy as np\n",
    "from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score\n",
    "\n",
    "# Example dummy data for demonstration\n",
    "y_true = np.array([0, 0, 1, 1]) # 0=Truth, 1=Lie (ground truth)\n",
    "y_pred = np.array([0, 1, 0, 1]) # model predictions\n",
    "y_scores = np.array([0.1, 0.9, 0.4, 0.8]) # predicted probability of 'Lie' for each instance\n",
    "\n",
    "print(\"Confusion Matrix:\\n\", confusion_matrix(y_true, y_pred))\n",
    "print(\"\\nClassification Report:\\n\", classification_report(y_true, y_pred, target_names=[\"Truth\",\"Lie\"]))\n",
    "print(\"AUC (ROC): %.2f\" % roc_auc_score(y_true, y_scores))"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Bias and Fairness\n",
    "It's crucial to assess the model's performance across different demographic or situational subsets to ensure fairness​:contentReference[oaicite:28]{index=28}. For example, we should check if the system is equally accurate for people of different genders, ethnicities, dialects, etc. If we notice performance gaps, techniques like re-balancing the training data or algorithmic fairness adjustments (e.g., using IBM's AIF360 toolkit) can help​:contentReference[oaicite:29]{index=29}.\n",
    "\n",
    "We also test the system's robustness:\n",
    "- Try intentionally noisy or low-quality inputs (blurry video, loud background noise in audio) to see if the system still performs reasonably​:contentReference[oaicite:30]{index=30}.\n",
    "- Ensure that the system fails gracefully (perhaps by increasing uncertainty) rather than giving confident false outputs when data is poor.\n",
    "\n",
    "By conducting these tests, we aim to catch issues like overfitting, bias, or instability early and address them before deployment."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 7. Ethical Considerations & Responsible AI Use\n",
    "\n",
    "Implementing a lie detection system raises serious ethical and legal questions. We must address these to use the technology responsibly:\n",
    "\n",
    "- **Accuracy and Consequences**: No lie detector is 100% accurate. False positives (labeling truthful people as liars) can cause unjust harm, and false negatives (missing a lie) can be security risks​:contentReference[oaicite:31]{index=31}. Thus, our system provides confidence scores and flags uncertain cases rather than making absolute judgments​:contentReference[oaicite:32]{index=32}. A human should always double-check important decisions.\n",
    "\n",
    "- **Bias and Fairness**: AI models can inadvertently be biased. If the training data isn't diverse, the system might be less accurate for certain groups (e.g., due to differences in facial expressions or speech patterns across cultures). We must strive to train on diverse data and test for bias. As one EU politician noted regarding AI lie detectors: *\"It will discriminate against anyone who is disabled or who has an anxious personality. It will not work.\"*​:contentReference[oaicite:33]{index=33}. We must be vigilant that our system does not unfairly target certain traits or communities.\n",
    "\n",
    "- **Privacy**: By nature, this system analyzes personal and biometric data (faces, voices, physiological signals). Under privacy laws like GDPR, such data is highly sensitive​:contentReference[oaicite:34]{index=34}. We should obtain informed consent from subjects, ensure data is securely stored or processed locally, and allow individuals to opt-out. Only the necessary data for the analysis should be collected, and it should be deleted after use unless explicitly consented for storage.\n",
    "\n",
    "- **Legal Compliance**: In some jurisdictions, using AI for lie detection (especially in law enforcement or hiring) could be regulated or even prohibited. The upcoming EU AI Act, for example, classifies \"emotion recognition\" systems as high-risk​:contentReference[oaicite:35]{index=35}. Deployers must ensure they follow all relevant laws and regulations. Also, this system should complement human judgment, not replace it​:contentReference[oaicite:36]{index=36}. For critical decisions (like criminal investigations), AI output should not be the sole evidence.\n",
    "\n",
    "- **Pseudoscience and Limitations**: The scientific community is still debating how effective AI is at detecting deception. Some critics call these systems \"pseudoscience\" if claimed to be foolproof​:contentReference[oaicite:37]{index=37}. We acknowledge that this tool has limitations and should not be considered a magical truth machine. It's an assistive tool that highlights potential signs of deceit, which a human expert must interpret with caution​:contentReference[oaicite:38]{index=38}. Transparency about the system's accuracy and caveats is essential.\n",
    "\n",
    "- **Ethical Use Policies**: Anyone deploying such a system should have clear policies: when it is appropriate to use (and when not), who has access to the results, and how to ensure accountability. Logs of the agent's reasoning and human interventions should be kept (for example, to audit decisions)​:contentReference[oaicite:39]{index=39}. Users of the system should be trained in understanding its outputs and the uncertainty involved. Ultimately, the goal is to aid truth-finding, not to unfairly accuse innocent people or violate privacy.\n",
    "\n",
    "By considering these factors, we aim to develop and deploy the lie detection system in a way that is **fair, transparent, and accountable**. Responsible AI use isn't just a final step – it's a continuous process of monitoring and improving the system in the real world."
    ]
    }
    ]
    {
    "cell_type": "markdown",
    "id": "7c1c0192",
    "metadata": {},
    "source": [
    "## 1. Installation & Setup\n",
    "In this section, we install all required libraries and set up the environment.\n",
    "We'll use `pip` to install necessary packages and mount Google Drive to access datasets like the **Strawberry-Phi** deception dataset.\n",
    "\n",
    "#### Dependencies:\n",
    "- `torch` for deep learning model implementation (CNNs, LSTMs, transformers).\n",
    "- `transformers` for the text model and NLP tasks.\n",
    "- `opencv-python` for video processing (facial cues from images).\n",
    "- `librosa` for audio signal processing (extracting voice features).\n",
    "- `shap` and `lime` for explainable AI (interpret model decisions).\n",
    "- `scikit-learn` for evaluation metrics and possibly simple model components.\n",
    "- `ipywidgets` for interactive UI elements (uploading files, toggling options).\n",
    "\n",
    "We'll also mount Google Drive to load the **Strawberry-Phi** dataset for fine-tuning later."
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "47f5d9af",
    "metadata": {
    "tags": [
    "hide-output"
    ]
    },
    "outputs": [],
    "source": [
    "!pip install torch transformers opencv-python librosa shap lime scikit-learn ipywidgets\n",
    "\n",
    "# Mount Google Drive (if running in Colab)\n",
    "from google.colab import drive\n",
    "drive.mount('/content/drive')"
    ]
    },
    {
    "cell_type": "markdown",
    "id": "4013300f",
    "metadata": {},
    "source": [
    "## 2. Project Overview\n",
    "**Multi-Modal Deception Detection** involves analyzing multiple data streams (like facial expressions, voice, text, and physiological signals) to determine if a subject is being deceptive. By combining modalities, we can improve accuracy since deceit often manifests through subtle cues in different channels​:contentReference[oaicite:0]{index=0}.\n",
    "\n",
    "**ReAct Reasoning Framework**: The ReAct (Reason + Act) framework interleaves logical reasoning with actionable operations. Instead of making predictions blindly, the system generates a reasoning trace (chain-of-thought) and uses that to inform its actions. This combined approach has been shown to improve decision-making and interpretability​:contentReference[oaicite:1]{index=1}. In practice, the agent will reason about the inputs (e.g., \"The subject is fidgeting and voice pitch is high, which often indicates stress\") and take actions (e.g., flag as potential lie) in a loop​:contentReference[oaicite:2]{index=2}.\n",
    "\n",
    "We also integrate **GSPO (Generative Self-Play Optimization)** with ReAct. GSPO uses self-play reinforcement learning: the model can simulate conversations or scenarios with itself to improve its lie-detection policy over time. This optional module lets the system learn from hypothetical scenarios, gradually refining its decision boundaries.\n",
    "\n",
    "#### Ethical AI Considerations:\n",
    "- **Transparency**: Our system provides reasoning traces and uses explainability tools (LIME, SHAP) so users can understand *why* a decision was made, addressing the \"lack of explainability\" concern in AI lie detection​:contentReference[oaicite:3]{index=3}.\n",
    "- **Bias Mitigation**: We must ensure the models do not overfit to demographic features (e.g., avoiding predictions based on gender or ethnicity). Training on diverse data and testing for bias helps create fair outcomes.\n",
    "- **Privacy**: All processing is done locally (no data is sent to external servers). We avoid storing sensitive personal data and only use the inputs for real-time analysis.\n",
    "- **Responsible Use**: Lie detection AI can be misused. This notebook is for research and educational purposes. Any real-world deployment should comply with legal standards and consider the potential for false positives/negatives.\n"
    ]
    },
    {
    "cell_type": "markdown",
    "id": "c85d16a4",
    "metadata": {},
    "source": [
    "## 3. Model Implementations\n",
    "We implement separate models for each modality. Each model outputs a confidence score or decision about deception for its modality. Later, we'll fuse these results.\n",
    "\n",
    "The models will be simple prototypes (not fully trained) to illustrate the architecture:\n",
    "- **Vision Model**: A CNN for facial expression and micro-expression analysis from video frames or images.\n",
    "- **Audio Model**: An LSTM (or GRU) for vocal analysis, capturing stress or pitch anomalies in speech.\n",
    "- **Text Model**: A Transformer (e.g., BERT) for analyzing textual statements for linguistic cues of deception.\n",
    "- **Physiological Model (Optional)**: Placeholder for processing signals like heart rate or skin conductance.\n"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "a577b2d2",
    "metadata": {},
    "outputs": [],
    "source": [
    "# Vision Model: CNN-based facial analysis\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "\n",
    "class VisionCNN(nn.Module):\n",
    " def __init__(self):\n",
    " super(VisionCNN, self).__init__()\n",
    " # Simple CNN: 2 conv layers + FC\n",
    " self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)\n",
    " self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)\n",
    " self.pool = nn.MaxPool2d(2, 2)\n",
    " # Assuming input images are 64x64, after 2 pools -> 16x16\n",
    " self.fc1 = nn.Linear(32 * 16 * 16, 2) # output: [lie_score, truth_score]\n",
    "\n",
    " def forward(self, x):\n",
    " x = self.pool(F.relu(self.conv1(x)))\n",
    " x = self.pool(F.relu(self.conv2(x)))\n",
    " x = x.view(x.size(0), -1)\n",
    " x = self.fc1(x)\n",
    " return x\n",
    "\n",
    "# Instantiate the vision model (untrained for now)\n",
    "vision_model = VisionCNN()\n",
    "print(vision_model)"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "6087ded2",
    "metadata": {},
    "outputs": [],
    "source": [
    "# Audio Model: LSTM-based vocal stress analysis\n",
    "import numpy as np\n",
    "import torch.nn.utils.rnn as rnn_utils\n",
    "\n",
    "class AudioLSTM(nn.Module):\n",
    " def __init__(self, input_size=13, hidden_size=32, num_layers=1):\n",
    " super(AudioLSTM, self).__init__()\n",
    " self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)\n",
    " self.fc = nn.Linear(hidden_size, 2) # 2 classes: lie or truth\n",
    "\n",
    " def forward(self, x, lengths=None):\n",
    " # x: batch of sequences (batch, seq_len, features)\n",
    " if lengths is not None:\n",
    " # pack padded sequence if lengths provided\n",
    " x = rnn_utils.pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)\n",
    " lstm_out, _ = self.lstm(x)\n",
    " if lengths is not None:\n",
    " lstm_out, _ = rnn_utils.pad_packed_sequence(lstm_out, batch_first=True)\n",
    " # Take output of last time step\n",
    " if lengths is not None:\n",
    " idx = (lengths - 1).view(-1, 1, 1).expand(lstm_out.size(0), 1, lstm_out.size(2))\n",
    " last_outputs = lstm_out.gather(1, idx).squeeze(1)\n",
    " else:\n",
    " last_outputs = lstm_out[:, -1, :]\n",
    " out = self.fc(last_outputs)\n",
    " return out\n",
    "\n",
    "# Instantiate the audio model (untrained placeholder)\n",
    "audio_model = AudioLSTM()\n",
    "print(audio_model)"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "bcd6bc3a",
    "metadata": {},
    "outputs": [],
    "source": [
    "# Text Model: Transformer-based deception analysis\n",
    "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
    "import torch\n",
    "import torch.nn.functional as F\n",
    "\n",
    "# We use a pre-trained BERT model for binary classification (truth/lie)\n",
    "model_name = 'bert-base-uncased'\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "text_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)\n",
    "\n",
    "# Function to get prediction from text model\n",
    "def text_model_predict(text):\n",
    " inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)\n",
    " outputs = text_model(**inputs)\n",
    " logits = outputs.logits\n",
    " probs = F.softmax(logits, dim=1)\n",
    " # probs is a tensor of shape (batch_size, 2)\n",
    " prob_np = probs.detach().cpu().numpy()\n",
    " return prob_np\n",
    "\n",
    "# Example usage (with dummy text)\n",
    "example_text = \"I absolutely did not take the money.\" # a deceptive statement example\n",
    "probs = text_model_predict([example_text])\n",
    "print(f\"Predicted probabilities (lie/truth) for example text: {probs}\")"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "0af87b99",
    "metadata": {},
    "outputs": [],
    "source": [
    "# Physiological Model (Optional): Placeholder for biometric data analysis\n",
    "# Example of physiological signals: heart rate, skin conductance, blood pressure, etc.\n",
    "# We'll create a simple placeholder class that could be extended for real sensor input.\n",
    "\n",
    "class PhysiologicalModel:\n",
    " def __init__(self):\n",
    " # No actual model, just a placeholder\n",
    " self.name = 'PhysioModel'\n",
    " def predict(self, data):\n",
    " # data could be a dictionary of sensor readings\n",
    " # Here we return a dummy neutral prediction\n",
    " return np.array([0.5, 0.5]) # equal probability of lie/truth\n",
    "\n",
    "physio_model = PhysiologicalModel()\n",
    "print(\"Physiological model ready (placeholder):\", physio_model.name)"
    ]
    },
    {
    "cell_type": "markdown",
    "id": "bd3fe080",
    "metadata": {},
    "source": [
    "## 4. GSPO Integration\n",
    "Here we integrate **Generative Self-Play Optimization (GSPO)** to enhance the model's decision-making through reinforcement learning. In GSPO, the system can create simulated scenarios and learn from them (like an agent playing against itself to improve skill).\n",
    "\n",
    "- **Self-Play Reinforcement Learning**: The model (as an agent) plays both roles in a deception scenario (questioner and responder). For example, it might simulate asking a question and then answering either truthfully or deceptively. The agent then tries to predict deception on these simulated answers, receiving a reward for correct detection. Over many iterations, this self-play helps the agent refine its policy for detecting lies.\n",
    "- This approach is inspired by how game-playing AIs train via self-play (e.g., AlphaGo Zero using self-play to surpass human performance). It allows the model to explore a wide range of scenarios beyond the initial dataset.\n",
    "\n",
    "- **Optional Learning Toggle**: We implement GSPO in a modular way. Users can turn this self-play learning on or off (for example, to compare performance with/without reinforcement learning). By default, the system won't do self-play unless explicitly enabled, to avoid long training times in this demo.\n",
    "\n",
    "- **Fine-Tuning with Strawberry-Phi Dataset**: We incorporate a fine-tuning phase using the `strawberry-phi` dataset, which is assumed to contain recorded deception instances (possibly multi-modal). Fine-tuning on real or richly simulated data like Strawberry-Phi ensures the models align better with actual deception cues.\n"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "228f6b87",
    "metadata": {},
    "outputs": [],
    "source": [
    "# GSPO Self-Play Reinforcement Learning (simplified simulation)\n",
    "import random\n",
    "\n",
    "class SelfPlayAgent:\n",
    " def __init__(self, detector_model):\n",
    " self.model = detector_model # could be a combined model or policy\n",
    " self.learning = False\n",
    " self.training_history = []\n",
    "\n",
    " def enable_learning(self, flag=True):\n",
    " self.learning = flag\n",
    "\n",
    " def simulate_scenario(self):\n",
    " \"\"\"Simulate a deception scenario. Returns (input_data, is_deceptive).\"\"\"\n",
    " # For simplicity, random simulation: generate a random outcome\n",
    " # In practice, this could use a generative model to create realistic scenarios\n",
    " is_deceptive = random.choice([0, 1]) # 0 = truth, 1 = lie\n",
    " simulated_data = {\n",
    " 'video': None, # no actual video in this simulation\n",
    " 'audio': None,\n",
    " 'text': \"simulated statement\",\n",
    " 'physio': None\n",
    " }\n",
    " return simulated_data, is_deceptive\n",
    "\n",
    " def train_self_play(self, episodes=5):\n",
    " if not self.learning:\n",
    " print(\"Self-play learning is disabled. Skipping training.\")\n",
    " return\n",
    " for ep in range(episodes):\n",
    " data, truth_label = self.simulate_scenario()\n",
    " # Here we would run the detection model on the simulated data\n",
    " # and get a prediction (e.g., 1 for lie, 0 for truth)\n",
    " # We'll simulate prediction randomly for this demo:\n",
    " pred_label = random.choice([0, 1])\n",
    " reward = 1 if pred_label == truth_label else -1\n",
    " # In a real scenario, use this reward to update model (e.g., policy gradient)\n",
    " self.training_history.append(reward)\n",
    " print(f\"Episode {ep+1}: truth={truth_label}, pred={pred_label}, reward={reward}\")\n",
    "\n",
    "# Initialize a self-play agent (using text model as base for simplicity)\n",
    "agent = SelfPlayAgent(text_model)\n",
    "agent.enable_learning(flag=False) # Disabled by default\n",
    "agent.train_self_play(episodes=3)"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "4615c03c",
    "metadata": {},
    "outputs": [],
    "source": [
    "# Fine-tuning with Strawberry-Phi dataset (placeholder)\n",
    "import pandas as pd\n",
    "phi_data = None\n",
    "try:\n",
    " # Attempt to load JSONL\n",
    " phi_data = pd.read_json('/content/drive/MyDrive/strawberry-phi.jsonl', lines=True)\n",
    "except Exception:\n",
    " try:\n",
    " phi_data = pd.read_parquet('/content/drive/MyDrive/strawberry-phi.parquet')\n",
    " except Exception as e:\n",
    " print(\"Strawberry-Phi dataset not found. Please upload it to Google Drive.\")\n",
    "\n",
    "if phi_data is not None:\n",
    " print(\"Strawberry-Phi data loaded. Rows:\", len(phi_data))\n",
    " # TODO: process the dataset, e.g., extract features, train models\n",
    "else:\n",
    " print(\"Proceeding without Strawberry-Phi fine-tuning.\")"
    ]
    },
    {
    "cell_type": "markdown",
    "id": "8660904a",
    "metadata": {},
    "source": [
    "## 5. Fusion Model\n",
    "After obtaining results from each modality-specific model, we need to combine them into a final decision. This is handled by a **Fusion Model** or strategy.\n",
    "\n",
    "Common fusion approaches:\n",
    "- **Majority Voting**: Each modality votes truth or lie, and the majority wins. This is simple and robust to one model's errors.\n",
    "- **Weighted Ensemble**: Assign weights to each modality based on confidence or accuracy, then compute a weighted sum of lie probabilities.\n",
    "- **Learned Fusion (Meta-Model)**: Train a separate classifier that takes each model's output (or confidence) as input features and outputs the final decision. This could be a small neural network or logistic regression trained on a validation set.\n",
    "\n",
    "For our system, we'll implement a simple weighted approach. We assume each model outputs a probability of deception (lie). We'll average these probabilities (or give higher weight to modalities we trust more) and then apply a threshold.\n"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "f8000823",
    "metadata": {},
    "outputs": [],
    "source": [
    "# Fusion function for combining modality outputs\n",
    "def fuse_outputs(results, weights=None):\n",
    " \"\"\"\n",
    " results: list of dictionaries with 'lie_score' or probabilities for lie from each modality.\n",
    " weights: optional list of weights for each modality.\n",
    " returns: final decision ('lie' or 'truth') and combined score.\n",
    " \"\"\"\n",
    " if weights is None:\n",
    " weights = [1] * len(results)\n",
    " total_weight = sum(weights)\n",
    " # weighted sum of lie probabilities\n",
    " combined_score = 0.0\n",
    " for res, w in zip(results, weights):\n",
    " # if res is a probability or has 'lie' key\n",
    " if isinstance(res, dict):\n",
    " lie_prob = res.get('lie') or res.get('lie_score') or (res[1] if isinstance(res, (list, tuple, np.ndarray)) else res)\n",
    " else:\n",
    " lie_prob = float(res)\n",
    " combined_score += w * lie_prob\n",
    " combined_score /= total_weight\n",
    " decision = 'lie' if combined_score >= 0.5 else 'truth'\n",
    " return decision, combined_score\n",
    "\n",
    "# Example: fuse dummy outputs from the models\n",
    "vision_out = {'lie': 0.7, 'truth': 0.3}\n",
    "audio_out = {'lie': 0.4, 'truth': 0.6}\n",
    "text_out = {'lie': 0.9, 'truth': 0.1}\n",
    "physio_out = {'lie': 0.5, 'truth': 0.5}\n",
    "final_decision, score = fuse_outputs([vision_out, audio_out, text_out, physio_out])\n",
    "print(f\"Final decision: {final_decision} (lie probability = {score:.2f})\")"
    ]
    },
    {
    "cell_type": "markdown",
    "id": "85e09344",
    "metadata": {},
    "source": [
    "## 6. ReAct Agent\n",
    "The ReAct agent is responsible for the reasoning-action loop. It should mimic how an expert would analyze evidence step-by-step, and justify each conclusion with reasoning before making the next move (action). Our ReAct agent will use the outputs from the above models and reason about them interactively.\n",
    "\n",
    "Key aspects of our ReAct implementation:\n",
    "- The agent will gather observations from each modality (e.g., *\"Vision model sees nervous facial expression.\"*).\n",
    "- It will reason about these observations (*\"Nervous face + high voice pitch = likely stress from lying\"*).\n",
    "- Based on reasoning, it may decide an action, such as concluding \"lie\" or maybe asking for more input if uncertain.\n",
    "- The loop continues if more reasoning or data is needed. For simplicity, our agent will do one pass of reasoning and then decide.\n",
    "\n",
    "The agent's decision-making process (as pseudocode):\n",
    "1. **Observe**: Get inputs from modalities.\n",
    "2. **Reason**: Form a narrative like \"The text content contradicts known facts and the speaker's voice is shaky.\".\n",
    "3. **Act**: Decide on an output (lie or truth) or ask for more data if needed.\n",
    "4. **Explain**: Provide the reasoning trace to the user for transparency.\n"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "e8391df5",
    "metadata": {},
    "outputs": [],
    "source": [
    "# ReAct Agent Implementation (simplified reasoning loop)\n",
    "def react_agent_decision(video=None, audio=None, text=None, physio=None):\n",
    " reasoning_trace = []\n",
    " modality_results = []\n",
    " # 1. Observe from each modality if available\n",
    " if video is not None:\n",
    " # Use vision model to get lie probability\n",
    " # (Here we simulate by random since we don't have actual video frames)\n",
    " vision_prob = random.random()\n",
    " modality_results.append({'lie': vision_prob, 'truth': 1-vision_prob})\n",
    " reasoning_trace.append(f\"Vision analysis suggests lie probability {vision_prob:.2f}.\")\n",
    " if audio is not None:\n",
    " audio_prob = random.random()\n",
    " modality_results.append({'lie': audio_prob, 'truth': 1-audio_prob})\n",
    " reasoning_trace.append(f\"Audio analysis suggests lie probability {audio_prob:.2f}.\")\n",
    " if text is not None:\n",
    " # Use text model\n",
    " probs = text_model_predict([text]) # get [ [lie_prob, truth_prob] ]\n",
    " lie_prob = float(probs[0][0])\n",
    " modality_results.append({'lie': lie_prob, 'truth': float(probs[0][1])})\n",
    " reasoning_trace.append(f\"Text analysis suggests lie probability {lie_prob:.2f} for the statement.\")\n",
    " if physio is not None:\n",
    " physio_prob = random.random()\n",
    " modality_results.append({'lie': physio_prob, 'truth': 1-physio_prob})\n",
    " reasoning_trace.append(f\"Physiological analysis suggests lie probability {physio_prob:.2f}.\")\n",
    " \n",
    " if not modality_results:\n",
    " return \"No input provided\", None\n",
    " # 2. Reason: (In a more complex system, we could add additional logical rules or ask follow-up questions.)\n",
    " if len(modality_results) > 1:\n",
    " reasoning_trace.append(\"Combining all modalities to form a conclusion.\")\n",
    " else:\n",
    " reasoning_trace.append(\"Single modality provided, basing conclusion on that alone.\")\n",
    " \n",
    " # 3. Act: fuse results to get final decision\n",
    " decision, score = fuse_outputs(modality_results)\n",
    " reasoning_trace.append(f\"Final decision: {decision.upper()} (confidence {score:.2f}).\")\n",
    " \n",
    " return \"\\n\".join(reasoning_trace), decision\n",
    "\n",
    "# Example usage of ReAct agent:\n",
    "reasoning, decision = react_agent_decision(video=True, audio=True, text=\"I am telling the truth.\")\n",
    "print(\"Reasoning Trace:\\n\" + reasoning)\n",
    "print(\"Decision:\", decision)"
    ]
    },
    {
    "cell_type": "markdown",
    "id": "1329ce16",
    "metadata": {},
    "source": [
    "## 7. Interactive Features\n",
    "To make the system interactive, we include features that allow user input and involvement:\n",
    "\n",
    "- **File Uploads**: Users can upload video, audio, or text for analysis. We use `ipywidgets` to provide UI elements (like file upload buttons) in Colab.\n",
    "- **Human-in-the-loop Validation**: After the model makes a decision, the user can review the reasoning and provide feedback or corrections. For example, if the model is wrong, the user could label the instance, which could be logged for further training.\n",
    "- **Explainability Tools**: We integrate LIME and SHAP to explain model predictions. For example, LIME can highlight which words in the text most influenced the prediction, or SHAP can indicate which facial features contributed to the vision model's output.\n",
    "\n",
    "These features help users trust and verify the system's outputs, turning the detection process into a cooperative effort between AI and human.\n"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "1859e2e7",
    "metadata": {},
    "outputs": [],
    "source": [
    "# Interactive widget for file upload\n",
    "import ipywidgets as widgets\n",
    "\n",
    "# Create upload widgets for video, audio, text\n",
    "video_upload = widgets.FileUpload(accept=\".mp4,.mov,.avi\", description=\"Upload Video\", multiple=False)\n",
    "audio_upload = widgets.FileUpload(accept=\".wav,.mp3\", description=\"Upload Audio\", multiple=False)\n",
    "text_input = widgets.Textarea(placeholder='Enter text to analyze', description='Text:')\n",
    "\n",
    "# Display widgets\n",
    "display(video_upload)\n",
    "display(audio_upload)\n",
    "display(text_input)\n",
    "\n",
    "# Button to trigger analysis\n",
    "analyze_button = widgets.Button(description=\"Analyze\")\n",
    "output_area = widgets.Output()\n",
    "\n",
    "def on_analyze_clicked(b):\n",
    " with output_area:\n",
    " output_area.clear_output()\n",
    " vid_file = list(video_upload.value.values())[0] if video_upload.value else None\n",
    " aud_file = list(audio_upload.value.values())[0] if audio_upload.value else None\n",
    " txt = text_input.value if text_input.value else None\n",
    " reasoning, decision = react_agent_decision(video=vid_file, audio=aud_file, text=txt)\n",
    " print(\"Reasoning:\\n\" + reasoning)\n",
    " print(\"Decision:\", decision)\n",
    "\n",
    "analyze_button.on_click(on_analyze_clicked)\n",
    "display(analyze_button)\n",
    "display(output_area)"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "765ecaf3",
    "metadata": {},
    "outputs": [],
    "source": [
    "# Explainability Example with LIME (for text model)\n",
    "from lime.lime_text import LimeTextExplainer\n",
    "\n",
    "explainer = LimeTextExplainer(class_names=[\"Truth\", \"Lie\"])\n",
    "# We'll use the text model's predict function for probabilities\n",
    "if 'text_model_predict' in globals():\n",
    " exp = explainer.explain_instance(\"I swear I didn't do it\", \n",
    " lambda x: text_model_predict(x), \n",
    " num_features=5)\n",
    " # Display the explanation in notebook (as text)\n",
    " explanation = exp.as_list()\n",
    " print(\"Top influences for the text model prediction:\")\n",
    " for word, score in explanation:\n",
    " print(f\"{word}: {score:.3f}\")\n",
    "else:\n",
    " print(\"Text model not available for explanation.\")"
    ]
    },
    {
    "cell_type": "markdown",
    "id": "f85ffbf7",
    "metadata": {},
    "source": [
    "## 8. Inference & Real-Time Processing\n",
    "Now that we have the components in place, we can use the system for inference on new data. This could be done in batch (one input at a time) or in real-time.\n",
    "\n",
    "For **real-time processing**, imagine a scenario like a live interview or interrogation. The system would continuously capture video frames and audio snippets, run them through the respective models, and update its deception probability in real-time. The ReAct agent can continuously reason over the new data.\n",
    "\n",
    "In this notebook setting, we'll simulate real-time processing by iterating through some data or using a loop with delays. In a real deployment, one could use threads or async processes to handle streaming data from a webcam and microphone.\n",
    "\n",
    "*Note:* Real-time use requires efficient processing and possibly hardware acceleration (GPU) to keep up with live data. There's also a need to smooth predictions over time to avoid jitter (e.g., using a rolling average of recent outputs).\n"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "4e15e160",
    "metadata": {},
    "outputs": [],
    "source": [
    "# Simulated real-time processing\n",
    "import time\n",
    "\n",
    "# Suppose we have a list of incoming text segments (as an example of streaming data)\n",
    "streaming_texts = [\n",
    " \"Hello, I'm happy to talk to you.\",\n",
    " \"I have nothing to hide.\",\n",
    " \"(nervous laugh) Sure, ask me anything...\",\n",
    " \"I already told you everything I know.\"\n",
    "]\n",
    "\n",
    "print(\"Starting live analysis loop...\\n\")\n",
    "for segment in streaming_texts:\n",
    " # Simulate delay as if processing streaming input\n",
    " time.sleep(1)\n",
    " reasoning, decision = react_agent_decision(text=segment)\n",
    " print(f\"Input: {segment}\\nDecision: {decision.upper()}\\n\")"
    ]
    },
    {
    "cell_type": "markdown",
    "id": "de0440b8",
    "metadata": {},
    "source": [
    "## 9. Testing & Evaluation\n",
    "To ensure our system works as expected, we include testing and evaluation steps:\n",
    "\n",
    "- **Unit Tests**: We create simple tests for each component (e.g., check that the vision model outputs the correct shape, or the fusion function behaves correctly). In Python, one could use the `unittest` framework or simple `assert` statements for validation.\n",
    "- **Performance Evaluation**: If we have labeled test data, we can measure accuracy, F1-score, AUC, etc. Here we'll simulate predictions and compute a confusion matrix and classification report using scikit-learn.\n",
    "- **Fairness Assessments**: It's important to test the model for bias. If we had data tagged with demographics, we could check performance separately for each group to ensure consistency. We might also use techniques like counterfactual testing (e.g., swapping gender-specific words in text to see if prediction changes) to identify bias.\n"
    ]
    },
    {
    "cell_type": "code",
    "execution_count": null,
    "id": "8e1712b6",
    "metadata": {},
    "outputs": [],
    "source": [
    "# Simple Unit Test for Fusion Function\n",
    "assert fuse_outputs([{'lie':0.8,'truth':0.2}, {'lie':0.8,'truth':0.2}])[0] == 'lie', \"Fusion failed for obvious lie case\"\n",
    "assert fuse_outputs([{'lie':0.1,'truth':0.9}, {'lie':0.2,'truth':0.8}])[0] == 'truth', \"Fusion failed for obvious truth case\"\n",
    "print(\"Fusion function unit tests passed!\")\n",
    "\n",
    "# Simulated Performance Evaluation\n",
    "from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report\n",
    "# Simulate some ground truth labels and predictions (1=lie, 0=truth)\n",
    "y_true = [0, 0, 1, 1, 1, 0]\n",
    "y_pred = [0, 1, 1, 1, 0, 0]\n",
    "print(\"Accuracy:\", accuracy_score(y_true, y_pred))\n",
    "print(\"F1-score:\", f1_score(y_true, y_pred, average='binary'))\n",
    "print(\"Confusion Matrix:\\n\", confusion_matrix(y_true, y_pred))\n",
    "print(\"Classification Report:\\n\", classification_report(y_true, y_pred, target_names=[\"Truth\",\"Lie\"]))"
    ]
    },
    {
    "cell_type": "markdown",
    "id": "777b0ba6",
    "metadata": {},
    "source": [
    "## 10. Ethical Considerations\n",
    "Building a lie detection system raises important ethical questions. We conclude by addressing these aspects:\n",
    "\n",
    "- **Privacy**: Deception detection can be very invasive. Video and audio analysis might reveal sensitive information. It's crucial to obtain informed consent from individuals being analyzed and ensure data is stored securely (or not at all, in our design).\n",
    "- **Bias and Fairness**: As noted earlier, AI models can inadvertently learn biases. For example, certain facial expressions might be more common in some cultures but not indicate lying. We should continuously test for and mitigate bias. Techniques include balanced training data, bias correction algorithms, and human review of contentious cases.\n",
    "- **False Accusations**: No lie detector is 100% accurate – even humans are fallible. AI predictions should not be taken as absolute truth. The system should ideally express uncertainty (e.g., a confidence score) and allow for an appeal or secondary review process. The cost of wrongly accusing someone is high, so threshold for calling something a lie should be carefully chosen.\n",
    "- **Legal Compliance**: Different jurisdictions have laws about recording conversations, biometric data use, and the admissibility of lie detection in court. Any deployment of this technology must comply with privacy laws (like GDPR) and regulations regarding such tools. Also, organizations like the APA have ethical guidelines on lie detection usage.\n",
    "- **Responsible Deployment**: We emphasize that this project is a prototype. In practice, one should involve ethicists, legal experts, and psychologists before using an AI lie detection system in real-world situations. It should augment human judgment, not replace it.\n",
    "\n",
    "By considering these factors, developers and users of lie detection AI can aim to minimize harm and maximize the benefits of the technology."
    ]
    }
    ],
    "metadata": {
    "kernelspec": {
    "display_name": "Python 3",
    "language": "python",
    "name": "python3"
    },
    "language_info": {
    "name": "python",
    "version": "3.9"
    }
    },
    "nbformat": 4,
    "nbformat_minor": 5
    }
  2. @ruvnet ruvnet revised this gist Feb 8, 2025. 1 changed file with 811 additions and 0 deletions.
    811 changes: 811 additions & 0 deletions notebook.ipynb
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,811 @@
    {
    "nbformat": 4,
    "nbformat_minor": 0,
    "metadata": {
    "colab": {
    "name": "MultiModal_LieDetection_ReAct_Tutorial.ipynb"
    },
    "kernelspec": {
    "display_name": "Python 3",
    "name": "python3"
    }
    },
    "cells": [
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "# Multi-Modal Lie Detection with ReAct: A Step-by-Step Tutorial\n",
    "In this tutorial, we implement a multi-modal lie detection system that analyzes **vision**, **audio**, **text**, and optionally **physiological** signals. By using an agent-based approach called **ReAct** (Reasoning + Acting), the system can reason about its outputs and involve humans in the loop for validation. We will cover everything from installing requirements to evaluating performance, while emphasizing privacy and ethical use.\n",
    "\n",
    "**Overview**:\n",
    "- *Installation & Setup*: Prepare the environment (Google Colab and Drive integration).\n",
    "- *Project Overview*: Understand multi-modal deception detection and the ReAct reasoning framework.\n",
    "- *Model Implementations*: Build models for facial cues, vocal stress, and text analysis, optionally including physiological data, and combine them.\n",
    "- *Interactive Features*: Use widgets and user input to incorporate human feedback and explain model decisions.\n",
    "- *Inference & Real-Time Processing*: Run the lie detector on sample inputs (video, audio, text) and simulate real-time usage.\n",
    "- *Testing & Evaluation*: Verify model components with tests, and evaluate accuracy, precision/recall, AUC, etc.\n",
    "- *Ethical Considerations*: Address bias, privacy, legal compliance (e.g., GDPR, EU AI Act), and responsible deployment practices.\n",
    "\n",
    "Let's get started!"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 1. Installation & Setup\n",
    "\n",
    "First, we need to install the required libraries and set up our environment. This notebook is designed for **Google Colab** for ease of use. It will also demonstrate how to integrate with **Google Drive** if you want to save or load data (like videos or models).\n",
    "\n",
    "**Dependencies**:\n",
    "- `torch` (PyTorch) for building deep learning models.\n",
    "- `transformers` (HuggingFace) for NLP models.\n",
    "- `opencv-python` for image and video processing.\n",
    "- `librosa` for audio processing.\n",
    "- `shap` and `lime` for explainability (optional).\n",
    "- `ipywidgets` for interactive widgets.\n",
    "- `scikit-learn` for evaluation metrics (optional).\n",
    "\n",
    "We'll also ensure we have access to GPU (if available) for faster computations and mount Google Drive for data storage."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "!pip install transformers opencv-python librosa shap lime scikit-learn ipywidgets"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "*Note:* If using Google Drive to store or retrieve data, you can mount it here:"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "from google.colab import drive\n",
    "drive.mount('/content/drive')"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "import torch\n",
    "print(\"Torch version:\", torch.__version__)\n",
    "print(\"GPU available:\", torch.cuda.is_available())"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 2. Project Overview\n"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Multi-Modal Deception Detection\n",
    "Combining multiple data modalities can improve the accuracy of lie detection by capturing different cues​:contentReference[oaicite:0]{index=0}. Traditional lie detection often relies on a single source like physiological signals (e.g., polygraph measurements), which is not very reliable​:contentReference[oaicite:1]{index=1}. In a multi-modal system, we analyze **facial expressions**, **voice tone**, **spoken or written text**, and even **physiological sensors** together. Each modality may provide unique indicators of stress or deceit:\n",
    "- **Vision**: Micro-expressions, eye movements, and body language (e.g., fidgeting) could suggest discomfort associated with lying.\n",
    "- **Audio**: Changes in pitch, tone, speech rate, or hesitation in voice can be signs of stress.\n",
    "- **Text**: Linguistic cues such as choice of words, sentiment, or contradictions in a story might indicate deception.\n",
    "- **Physiological**: Heart rate, skin conductance (sweating), etc., can reflect nervousness.\n",
    "\n",
    "By fusing these signals, the system reduces uncertainty from any single source and makes a more informed judgment​:contentReference[oaicite:2]{index=2}. Research has shown that integrating verbal and nonverbal cues improves detection performance compared to unimodal approaches​:contentReference[oaicite:3]{index=3}​:contentReference[oaicite:4]{index=4}."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### ReAct Reasoning and Agentic Decisions\n",
    "Instead of a black-box classifier, our system uses an **agent** that can reason about the inputs and its own outputs. We adopt the **ReAct (Reasoning + Acting)** framework, where the AI agent alternates between reasoning steps and actions​:contentReference[oaicite:5]{index=5}. In practice, this means the model will:\n",
    "1. **Reason**: Internally analyze the evidence (e.g., *\"Facial cues suggest stress, but vocal analysis is moderate\"*​:contentReference[oaicite:6]{index=6}).\n",
    "2. **Act**: Take an action based on that analysis (e.g., *decide to gather more information* or *flag for human review*).\n",
    "3. Repeat this reasoning-action loop, refining the decision with each step​:contentReference[oaicite:7]{index=7}.\n",
    "\n",
    "This agentic approach allows the system to not only output a prediction (truth or lie) but also an explanation of how it arrived there. The agent can use **recursive decision-making** – revisiting its conclusions if new evidence or actions suggest something different – and even use simple **reinforcement learning** techniques to improve over time​:contentReference[oaicite:8]{index=8}​:contentReference[oaicite:9]{index=9}. For example, the agent could learn from mistakes (with human feedback) and adjust its strategy in future interactions."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Human-in-the-Loop and Privacy\n",
    "To make the system reliable and responsible, we include a **human-in-the-loop** at critical points. This means a human (e.g., an investigator or analyst) can:\n",
    "- Review cases where the AI is uncertain or the modalities disagree.\n",
    "- Override the AI's decision if it seems incorrect.\n",
    "- Provide feedback that the AI uses to improve (a form of supervised reinforcement learning on mistakes).\n",
    "\n",
    "For instance, if facial and audio cues conflict strongly, the system can automatically flag the interview for human review instead of making a hard judgment​:contentReference[oaicite:10]{index=10}​:contentReference[oaicite:11]{index=11}. We will see later how the notebook can prompt for human input in such cases.\n",
    "\n",
    "**Privacy Considerations**: Because this system deals with sensitive biometric data (faces, voice recordings, heart rates, etc.), it is designed with privacy in mind. Data can be processed **on-device** or in a secure environment to avoid sending personal data to external servers​:contentReference[oaicite:12]{index=12}. Techniques like data anonymization and encryption are applied where possible. Under regulations like the **GDPR**, biometric data is considered highly sensitive and requires robust protection​:contentReference[oaicite:13]{index=13}. Therefore, any real deployment must ensure user consent is obtained and that data storage complies with privacy laws. In our demo, all data stays local to your Colab session or Google Drive to respect privacy."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 3. Model Implementations\n",
    "\n",
    "Now, we will implement the models for each modality and then create a fusion mechanism and the ReAct-based agent. For simplicity, we'll use relatively simple models and simulated data (since training a full model here is beyond scope). The focus is on the architecture and how these components interact, rather than achieving state-of-the-art accuracy.\n",
    "\n",
    "We'll implement the following:\n",
    "- **Vision Model**: a CNN to analyze facial video frames.\n",
    "- **Audio Model**: an LSTM-based model to analyze speech.\n",
    "- **Text Model**: a Transformer-based or simplified model to analyze transcript text.\n",
    "- **Physiological Model** (optional): a placeholder for handling sensor data (if available).\n",
    "- **Fusion Model**: a strategy to combine outputs from all modalities.\n",
    "- **ReAct Agent**: an agent that uses the fused results and reasoning rules to decide lie/truth and produce an explanation.\n",
    "\n",
    "Let's proceed step-by-step through each component."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Vision Model (Facial Analysis)\n",
    "For the vision modality, we'll use a Convolutional Neural Network (CNN) to extract facial cues. The model could analyze facial expressions or micro-expressions from video frames​:contentReference[oaicite:14]{index=14}. In practice, one might use a pre-trained model (like ResNet50) fine-tuned on emotion or expression datasets for subtle indicators of deceit​:contentReference[oaicite:15]{index=15}. Here, we'll build a simple CNN from scratch for demonstration.\n",
    "\n",
    "**Approach**:\n",
    "- We assume video frames or images of the subject are available.\n",
    "- We preprocess each frame (resize, normalize) and feed it into the CNN.\n",
    "- The CNN outputs a probability distribution over two classes: \"Truth\" vs \"Lie\".\n",
    "- For example, a tense facial expression or avoidance of eye contact might push the prediction towards \"Lie\".\n",
    "\n",
    "We'll implement a small CNN with a couple of convolutional layers and a final output layer with 2 neurons (for the two classes). No training is performed here; we'll use random weights to illustrate the pipeline."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Vision Model Implementation (CNN)\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "\n",
    "class VisionModel(nn.Module):\n",
    " def __init__(self):\n",
    " super(VisionModel, self).__init__()\n",
    " # Simple CNN: conv layers followed by a fully connected layer\n",
    " self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1) # downsample by 2\n",
    " self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=2, padding=1) # downsample further\n",
    " self.conv3 = nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1)\n",
    " self.fc = nn.Linear(32 * 8 * 8, 2) # assuming input frames 64x64 -> after 3 strides of 2 => 8x8 feature map\n",
    " def forward(self, x):\n",
    " x = F.relu(self.conv1(x))\n",
    " x = F.relu(self.conv2(x))\n",
    " x = F.relu(self.conv3(x))\n",
    " x = x.view(x.size(0), -1)\n",
    " x = self.fc(x)\n",
    " # Output as probabilities for [Truth, Lie]\n",
    " return torch.softmax(x, dim=1)\n",
    "\n",
    "# Instantiate the model and test on a dummy input\n",
    "vision_model = VisionModel()\n",
    "dummy_frame = torch.randn(1, 3, 64, 64) # batch of 1, 64x64 RGB image\n",
    "dummy_out = vision_model(dummy_frame)\n",
    "print(\"Vision model output (Truth,Lie probabilities):\", dummy_out.detach().numpy())"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Audio Model (Speech Analysis)\n",
    "For the audio modality, we analyze the speaker's voice. Signs of stress or deception can manifest as changes in vocal pitch, tone, pace, or disfluencies (ums, pauses)​:contentReference[oaicite:16]{index=16}. A common approach is to extract acoustic features (e.g., MFCCs, spectrograms) and use a sequence model to capture temporal patterns.\n",
    "\n",
    "We will implement an LSTM-based model that takes extracted features from the audio waveform and outputs a probability of truth/lie. In practice, one could use a pre-trained audio model like **Wav2Vec 2.0** for richer representations​:contentReference[oaicite:17]{index=17}, but here we'll keep it simple:\n",
    "- Use `librosa` to extract MFCC features from an audio sample.\n",
    "- Feed the sequence of MFCC vectors into an LSTM.\n",
    "- Use the final LSTM output (or hidden state) to classify lie vs truth.\n",
    "\n",
    "This model should capture things like elevated pitch or irregular pauses which might correlate with lying."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Audio Model Implementation (LSTM)\n",
    "import torch.nn as nn\n",
    "\n",
    "class AudioModel(nn.Module):\n",
    " def __init__(self, input_dim=13, hidden_dim=32):\n",
    " super(AudioModel, self).__init__()\n",
    " self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)\n",
    " self.fc = nn.Linear(hidden_dim, 2)\n",
    " def forward(self, x):\n",
    " # x shape: (batch, seq_len, input_dim)\n",
    " lstm_out, (h, c) = self.lstm(x)\n",
    " # Use last hidden state\n",
    " last_hidden = h[-1] # shape (batch, hidden_dim)\n",
    " out = self.fc(last_hidden)\n",
    " return torch.softmax(out, dim=1)\n",
    "\n",
    "audio_model = AudioModel()\n",
    "# Generate a dummy audio feature sequence (e.g., 50 time steps of 13-dim MFCCs)\n",
    "dummy_audio = torch.randn(1, 50, 13)\n",
    "dummy_audio_out = audio_model(dummy_audio)\n",
    "print(\"Audio model output (Truth,Lie probabilities):\", dummy_audio_out.detach().numpy())"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Text Model (Language Analysis)\n",
    "The text modality examines what the person is saying (or writing). Linguistic patterns can reveal deception – liars might use fewer first-person pronouns, or add certain qualifying phrases, etc. Modern approaches use Transformer-based models like BERT or RoBERTa to classify text as truthful or deceptive​:contentReference[oaicite:18]{index=18}​:contentReference[oaicite:19]{index=19}.\n",
    "\n",
    "To keep things simple, we'll implement a placeholder text model. For demonstration, we might use a basic keyword-based heuristic or a simple logistic model. (In a real system, you would fine-tune a pretrained transformer on a deception dataset​:contentReference[oaicite:20]{index=20}.)\n",
    "\n",
    "Our simplified text model will:\n",
    "- Take a transcript or statement as input (string).\n",
    "- Output a probability of lie/truth.\n",
    "- *(For demonstration, we'll use a trivial rule: if the statement contains negation words like \"not\", \"never\", we might lean towards \"lie\" to simulate detecting a denial. This is just a placeholder logic.)*"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Text Model Implementation (simplified)\n",
    "import numpy as np\n",
    "\n",
    "class TextModel:\n",
    " def __init__(self):\n",
    " # Example keywords indicative of deception (very naive approach):\n",
    " self.deception_keywords = {\"not\", \"never\", \"didn't\", \"cannot\"}\n",
    " def predict(self, text):\n",
    " \"\"\"Return a probability tensor [p_truth, p_lie] based on the presence of keywords.\"\"\"\n",
    " text_lower = text.lower()\n",
    " # Naive rule: if any deception keyword is present, assign higher lie probability\n",
    " lie_prob = 0.7 if any(word in text_lower for word in self.deception_keywords) else 0.3\n",
    " truth_prob = 1 - lie_prob\n",
    " probs = torch.tensor([[truth_prob, lie_prob]])\n",
    " return probs\n",
    "\n",
    "text_model = TextModel()\n",
    "# Test the text model with example inputs\n",
    "for example in [\"I was at home all evening.\", \"I did not take the money.\"]:\n",
    " out = text_model.predict(example)\n",
    " print(f\"Text: '{example}' -> Output (Truth,Lie):\", out.numpy())"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### (Optional) Physiological Model\n",
    "In some scenarios, we might have physiological data such as heart rate, skin conductance (GSR), or blood pressure. These signals can indicate stress levels (as used in a traditional polygraph)​:contentReference[oaicite:21]{index=21}. Integrating such data can provide additional clues to deception.\n",
    "\n",
    "For the scope of this tutorial, we will not implement a full physiological model, but here's how it could be handled:\n",
    "- If sensor data is available (e.g., a sequence of heart rate measurements during questioning), you could use a simple threshold model or a small neural network to detect anomalies.\n",
    "- For example, a sudden spike in heart rate or GSR could be interpreted as increased stress.\n",
    "- This model would output a probability of deception similar to the others.\n",
    "\n",
    "In our code, we'll assume we don't have this modality available. If you did, you would process it and include it in the fusion step just like the others."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Fusion Model (Integrating Modalities)\n",
    "After obtaining predictions from each modality, we need to combine them into a single decision. There are different fusion strategies​:contentReference[oaicite:22]{index=22}:\n",
    "- **Early Fusion**: combining raw features from all modalities and then classify (requires joint training).\n",
    "- **Late Fusion**: each modality gives an independent judgment (e.g., a probability of deception), and we combine those judgments (e.g., via averaging or a meta-classifier).\n",
    "- **Hybrid Fusion**: use a more complex model (like attention) to weight modalities dynamically​:contentReference[oaicite:23]{index=23}.\n",
    "\n",
    "We will implement a simple late fusion approach​:contentReference[oaicite:24]{index=24}: take the average of the \"lie\" probabilities from the vision, audio, and text models. This assumes each modality is equally important (which may not be true in all cases, but it's a simple and effective starting point).\n",
    "\n",
    "The fusion model will output a combined probability for truth/lie. We can then set a threshold (e.g., 0.5) on this combined probability to make the final classification."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Fusion function to combine modality outputs\n",
    "def fuse_predictions(predictions):\n",
    " \"\"\"\n",
    " Combine predictions from modalities.\n",
    " `predictions` is a list of [p_truth, p_lie] from each available modality.\n",
    " Returns a fused [p_truth, p_lie] list.\n",
    " \"\"\"\n",
    " preds = np.array(predictions)\n",
    " avg_probs = preds.mean(axis=0)\n",
    " # Ensure it sums to 1 (should already, if each pred is probabilities)\n",
    " avg_probs = avg_probs / avg_probs.sum()\n",
    " return avg_probs.tolist()\n",
    "\n",
    "# Example: fuse dummy outputs from the models\n",
    "vision_dummy = dummy_out.squeeze().tolist()\n",
    "audio_dummy = dummy_audio_out.squeeze().tolist()\n",
    "text_dummy = text_model.predict(\"Just a harmless example.\").squeeze().tolist()\n",
    "fused_dummy = fuse_predictions([vision_dummy, audio_dummy, text_dummy])\n",
    "print(\"Fused output (Truth,Lie probabilities):\", fused_dummy)"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### ReAct Agent (Reasoning and Action)\n",
    "Now we build the central **ReAct agent** that uses the outputs of all modalities and makes a final decision with reasoning. The agent will mimic a decision-making process:\n",
    "1. It looks at the inputs from each model (vision, audio, text, etc.).\n",
    "2. It generates a reasoning trace, e.g. notes if one modality strongly indicates \"lie\" while another indicates \"truth\".\n",
    "3. If there's disagreement or low confidence, it can decide to label the result as uncertain and ask for human input​:contentReference[oaicite:25]{index=25}.\n",
    "4. Otherwise, it makes a final call (truth or lie) and provides an explanation of how it reached that conclusion.\n",
    "\n",
    "In a real implementation, the agent could incorporate business rules or even a small reinforcement learning model to optimize its questioning strategy. We can also add a **neuro-symbolic** layer: for example, a rule like *\"If text content contradicts facial emotion, increase the deception probability\"*​:contentReference[oaicite:26]{index=26}.\n",
    "\n",
    "Our ReAct agent here will be rule-based for clarity:\n",
    "- If all modalities agree (all high lie or all low lie probability), take that as the decision.\n",
    "- If they conflict, the agent may either choose the majority or mark the result as \"Uncertain\" and suggest human review.\n",
    "- It will produce a reasoning log of its steps."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Agent Implementation using ReAct reasoning\n",
    "class LieDetectionAgent:\n",
    " def __init__(self, lie_threshold=0.5, conflict_threshold=0.2):\n",
    " \"\"\"\n",
    " lie_threshold: probability above which a modality votes 'Lie'.\n",
    " conflict_threshold: if difference between max and min lie probabilities is above this, flag conflict.\n",
    " \"\"\"\n",
    " self.lie_threshold = lie_threshold\n",
    " self.conflict_threshold = conflict_threshold\n",
    " \n",
    " def analyze(self, vision_pred, audio_pred, text_pred):\n",
    " \"\"\"\n",
    " Analyze the predictions from each modality.\n",
    " vision_pred, audio_pred, text_pred are each a list or tensor [p_truth, p_lie].\n",
    " Returns (final_decision, reasoning_trace).\n",
    " \"\"\"\n",
    " vision_pred = vision_pred if isinstance(vision_pred, list) else vision_pred.squeeze().tolist()\n",
    " audio_pred = audio_pred if isinstance(audio_pred, list) else audio_pred.squeeze().tolist()\n",
    " text_pred = text_pred if isinstance(text_pred, list) else text_pred.squeeze().tolist()\n",
    " modality_preds = {\n",
    " \"Vision\": vision_pred,\n",
    " \"Audio\": audio_pred,\n",
    " \"Text\": text_pred\n",
    " }\n",
    " reasoning_trace = []\n",
    " lie_probs = {}\n",
    " # Note each modality's lie probability\n",
    " for mod, pred in modality_preds.items():\n",
    " lie_prob = pred[1]\n",
    " lie_probs[mod] = lie_prob\n",
    " reasoning_trace.append(f\"{mod} model indicates lie probability = {lie_prob:.2f}.\")\n",
    " \n",
    " # Check for agreement or conflict\n",
    " max_mod = max(lie_probs, key=lie_probs.get)\n",
    " min_mod = min(lie_probs, key=lie_probs.get)\n",
    " max_prob = lie_probs[max_mod]\n",
    " min_prob = lie_probs[min_mod]\n",
    " if max_prob - min_prob > self.conflict_threshold:\n",
    " reasoning_trace.append(f\"High disagreement detected between modalities (range {min_prob:.2f}-{max_prob:.2f}).\")\n",
    " conflict = True\n",
    " else:\n",
    " conflict = False\n",
    " \n",
    " # Determine final decision based on average\n",
    " avg_lie_prob = sum(lie_probs.values()) / len(lie_probs)\n",
    " if avg_lie_prob > self.lie_threshold:\n",
    " final_decision = \"Lie\"\n",
    " else:\n",
    " final_decision = \"Truth\"\n",
    " \n",
    " reasoning_trace.append(f\"Average lie probability = {avg_lie_prob:.2f}, hence system verdict = '{final_decision}'.\")\n",
    " \n",
    " # If conflict, recommend human review\n",
    " if conflict:\n",
    " reasoning_trace.append(\"Modalities are inconsistent; flagging for human review.\")\n",
    " final_decision = final_decision + \" (Uncertain, needs human verification)\"\n",
    " \n",
    " return final_decision, reasoning_trace\n",
    "\n",
    "# Instantiate the agent\n",
    "agent = LieDetectionAgent()\n",
    "# Test agent with dummy predictions\n",
    "test_decision, test_trace = agent.analyze(vision_dummy, audio_dummy, text_dummy)\n",
    "print(\"Agent reasoning trace (demo):\")\n",
    "for line in test_trace:\n",
    " print(\"-\", line)\n",
    "print(\"Agent decision:\", test_decision)"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 4. Interactive Features\n",
    "\n",
    "Interactivity is key for a human-centered lie detection system. In this section, we'll discuss:\n",
    "- Uploading and processing user data (video, audio, text input).\n",
    "- Involving a human operator to validate or correct the AI's decisions.\n",
    "- Using explainability techniques to interpret model predictions."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Uploading Video/Audio/Text Data\n",
    "To test our system, we need to provide input data. In a Colab environment, you can upload files or use files stored in Google Drive.\n",
    "\n",
    "Below are examples of how to upload a video file and an audio file in Colab (you'll be prompted to choose files). Then we also take a text input as the transcript:"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "from google.colab import files\n",
    "\n",
    "# Upload a video file (e.g., .mp4)\n",
    "print(\"Please upload a video file for analysis:\")\n",
    "video_upload = files.upload()\n",
    "if video_upload:\n",
    " video_path = next(iter(video_upload))\n",
    " print(f\"Uploaded video: {video_path}\")\n",
    "\n",
    "# Upload an audio file (e.g., .wav)\n",
    "print(\"Please upload an audio file for analysis:\")\n",
    "audio_upload = files.upload()\n",
    "if audio_upload:\n",
    " audio_path = next(iter(audio_upload))\n",
    " print(f\"Uploaded audio: {audio_path}\")\n",
    "\n",
    "# Get text input (transcript)\n",
    "transcript = input(\"Enter the transcript or statement to analyze (or leave empty if not available): \")\n",
    "if transcript == \"\":\n",
    " transcript = \"No transcript provided.\"\n",
    " print(\"Using default text:\", transcript)\n",
    "else:\n",
    " print(\"Transcript received.\")"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Human-in-the-Loop Validation\n",
    "In a real deployment, whenever the AI system is unsure or just as a regular policy, a human should review the results. We can simulate this in the notebook. For instance, after the agent makes a prediction, we can ask the user (human) to confirm or correct it.\n",
    "\n",
    "We will integrate a step where the system's decision is presented, and the user can input whether they agree or if they want to override the decision. This could also be done with interactive widgets (like buttons or dropdowns) for a more user-friendly UI.\n",
    "\n",
    "### Explainability Tools\n",
    "To build trust, it's important to explain why the AI made a certain decision:\n",
    "- **SHAP and LIME** can highlight which features or words influenced the models' predictions.\n",
    "- **Grad-CAM** can show which regions of a video frame the CNN focused on when predicting \"lie\".\n",
    "- **Attention visualization** in transformers can show which words in the text were considered most important.\n",
    "\n",
    "For example, let's use LIME to explain the text model's decision for a sample input. We will see what words influence the model's output (remember, our text model is very simple, so this is just to demonstrate the process)."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Install and use LIME for explainability on text model\n",
    "!pip install --quiet lime\n",
    "from lime.lime_text import LimeTextExplainer\n",
    "\n",
    "# Define a predict function for our text model that LIME can call\n",
    "class_names = [\"Truth\", \"Lie\"]\n",
    "def text_model_predict(texts):\n",
    " results = []\n",
    " for t in texts:\n",
    " probs = text_model.predict(t).detach().numpy()[0]\n",
    " results.append(probs)\n",
    " return np.vstack(results)\n",
    "\n",
    "explainer = LimeTextExplainer(class_names=class_names)\n",
    "sample_text = \"Honestly, I did not steal anything.\"\n",
    "exp = explainer.explain_instance(sample_text, text_model_predict, num_features=6)\n",
    "print(\"LIME explanation for text:\\n\")\n",
    "for feature, weight in exp.as_list():\n",
    " print(f\"{feature}: {weight:.3f}\")"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "*Interpretation:* In the above output, LIME lists the words and their influence on the prediction. A positive weight indicates the word contributes to predicting \"Lie\", while a negative weight would support \"Truth\". We can see which keywords our simple model is relying on (for example, \"did\" or \"not\" might appear with positive weights since our model keys off negation). In a more advanced model, this helps identify important linguistic features."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 5. Inference & Real-Time Processing\n",
    "\n",
    "Now that all components are ready, let's run the lie detection system on some input data. We will use the data you provided (video, audio, text) in the previous step. The pipeline is:\n",
    "1. **Vision**: Read the video file, extract a frame (or frames) and get the vision model's prediction.\n",
    "2. **Audio**: Read the audio file, extract features, get the audio model's prediction.\n",
    "3. **Text**: Take the input transcript text and get the text model's prediction.\n",
    "4. **Fusion**: Combine the predictions from all available modalities.\n",
    "5. **Agent Decision**: Let the ReAct agent analyze the combined evidence and make a final decision (with a reasoning trace).\n",
    "6. **Human Verification** (optional): Allow a human to approve or override the decision.\n",
    "7. **Real-Time Considerations**: (Discussion) how to extend this to real-time analysis.\n",
    "\n",
    "Let's go through these steps."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "import cv2\n",
    "\n",
    "vision_pred = None\n",
    "if 'video_path' in locals() and video_path:\n",
    " cap = cv2.VideoCapture(video_path)\n",
    " success, frame = cap.read()\n",
    " cap.release()\n",
    " if success:\n",
    " # Preprocess the frame for the model\n",
    " frame_resized = cv2.resize(frame, (64, 64))\n",
    " frame_rgb = cv2.cvtColor(frame_resized, cv2.COLOR_BGR2RGB)\n",
    " frame_tensor = torch.from_numpy(frame_rgb).permute(2, 0, 1).unsqueeze(0).float() / 255.0\n",
    " with torch.no_grad():\n",
    " vision_out = vision_model(frame_tensor)\n",
    " vision_pred = vision_out.squeeze().tolist()\n",
    " print(f\"Vision model prediction (Truth,Lie): {vision_pred}\")\n",
    " else:\n",
    " print(\"Failed to read video frame. Vision model will be skipped.\")\n",
    "else:\n",
    " print(\"No video provided. Skipping vision analysis.\")"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "import librosa\n",
    "\n",
    "audio_pred = None\n",
    "if 'audio_path' in locals() and audio_path:\n",
    " y, sr = librosa.load(audio_path, sr=None, mono=True, duration=10)\n",
    " if y is not None:\n",
    " mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)\n",
    " mfcc = mfcc.T # shape (time_steps, 13)\n",
    " mfcc_tensor = torch.from_numpy(mfcc).unsqueeze(0).float()\n",
    " with torch.no_grad():\n",
    " audio_out = audio_model(mfcc_tensor)\n",
    " audio_pred = audio_out.squeeze().tolist()\n",
    " print(f\"Audio model prediction (Truth,Lie): {audio_pred}\")\n",
    " else:\n",
    " print(\"Could not load audio or audio is empty. Skipping audio analysis.\")\n",
    "else:\n",
    " print(\"No audio provided. Skipping audio analysis.\")"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Text analysis and Fusion\n",
    "text_pred = None\n",
    "if 'transcript' in locals() and transcript is not None:\n",
    " with torch.no_grad():\n",
    " text_out = text_model.predict(transcript)\n",
    " text_pred = text_out.squeeze().tolist()\n",
    " print(f\"Text model prediction (Truth,Lie): {text_pred}\")\n",
    "else:\n",
    " print(\"No text transcript provided. Skipping text analysis.\")\n",
    "\n",
    "# Combine available modality predictions\n",
    "available_preds = []\n",
    "if vision_pred is not None:\n",
    " available_preds.append(vision_pred)\n",
    "if audio_pred is not None:\n",
    " available_preds.append(audio_pred)\n",
    "if text_pred is not None:\n",
    " available_preds.append(text_pred)\n",
    "\n",
    "if available_preds:\n",
    " fused_pred = fuse_predictions(available_preds)\n",
    " print(\"Fused prediction (Truth,Lie):\", fused_pred)\n",
    "else:\n",
    " fused_pred = [0.5, 0.5]\n",
    " print(\"No modalities available to fuse. Defaulting to [0.5, 0.5].\")"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Agent decision\n",
    "final_decision, reasoning_trace = agent.analyze(\n",
    " vision_pred if vision_pred is not None else [1,0],\n",
    " audio_pred if audio_pred is not None else [1,0],\n",
    " text_pred if text_pred is not None else [1,0]\n",
    ")\n",
    "print(\"\\nAgent's reasoning trace:\")\n",
    "for line in reasoning_trace:\n",
    " print(\"*\", line)\n",
    "print(\"Agent's preliminary decision:\", final_decision)\n",
    "\n",
    "# Human-in-the-loop: ask user to approve or override\n",
    "user_feedback = input(\"Do you agree with this decision? (yes/no) \")\n",
    "if user_feedback.strip().lower() in [\"no\", \"n\"]:\n",
    " correct_label = input(\"Please enter the correct label ('Truth' or 'Lie'): \")\n",
    " print(f\"Human override: The correct label is '{correct_label}'.\")\n",
    " final_label = correct_label\n",
    "else:\n",
    " final_label = final_decision\n",
    " print(\"Decision accepted by human.\")\n",
    "\n",
    "print(\"\\nFinal decision (after human verification):\", final_label)"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "**Real-Time Use**: The above pipeline processes one batch of inputs. For real-time deception detection (e.g., during a live interview), you would continuously capture data and feed it to the models in a loop. For example:\n",
    "- Use a webcam feed to get frames and run the vision model on each (or every Nth) frame.\n",
    "- Stream audio input through the audio model in chunks.\n",
    "- Continuously update the transcript (if doing real-time speech-to-text) and analyze text segments.\n",
    "\n",
    "Such streaming implementation would require optimizing the models for speed and perhaps using asynchronous processing. However, the core steps remain similar to what we ran above. Additionally, the system should log each interaction (inputs, model outputs, reasoning) for audit and improvement​:contentReference[oaicite:27]{index=27}."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 6. Testing & Evaluation\n",
    "\n",
    "Building confidence in the system requires thorough testing and evaluation. We should test each component in isolation (unit tests) and the system as a whole (integration tests), and evaluate performance on collected data."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Unit Tests for Components\n",
    "We can write simple tests to ensure each model behaves as expected. For example, check that the VisionModel returns a probability tensor of the correct shape for a given image, or that the agent returns a decision string."
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "# Unit testing each component (simple examples)\n",
    "# Test VisionModel output shape\n",
    "test_img = torch.randn(1, 3, 64, 64)\n",
    "assert vision_model(test_img).shape == (1, 2)\n",
    "print(\"VisionModel unit test passed (output shape is 1x2).\")\n",
    "\n",
    "# Test AudioModel output shape\n",
    "test_audio = torch.randn(1, 10, 13) # 10 time steps of MFCC\n",
    "assert audio_model(test_audio).shape == (1, 2)\n",
    "print(\"AudioModel unit test passed (output shape is 1x2).\")\n",
    "\n",
    "# Test TextModel output type\n",
    "test_text_out = text_model.predict(\"This is a test.\")\n",
    "assert isinstance(test_text_out, torch.Tensor) and test_text_out.shape == (1, 2)\n",
    "print(\"TextModel unit test passed (output shape is 1x2).\")\n",
    "\n",
    "# Test Agent decision output\n",
    "dec, trace = agent.analyze([1,0], [1,0], [1,0]) # all modalities saying 'Truth'\n",
    "assert dec.startswith(\"Truth\")\n",
    "print(\"Agent unit test passed (agent returns a decision string).\")"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Performance Evaluation\n",
    "With a real dataset of labeled truthful and deceptive instances, we would train the models and then evaluate metrics like accuracy, precision, recall, and AUC (area under the ROC curve).\n",
    "\n",
    "For example, if we had arrays of true labels and predicted labels for a test set, we could do:"
    ]
    },
    {
    "cell_type": "code",
    "metadata": {},
    "execution_count": null,
    "outputs": [],
    "source": [
    "import numpy as np\n",
    "from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score\n",
    "\n",
    "# Example dummy data for demonstration\n",
    "y_true = np.array([0, 0, 1, 1]) # 0=Truth, 1=Lie (ground truth)\n",
    "y_pred = np.array([0, 1, 0, 1]) # model predictions\n",
    "y_scores = np.array([0.1, 0.9, 0.4, 0.8]) # predicted probability of 'Lie' for each instance\n",
    "\n",
    "print(\"Confusion Matrix:\\n\", confusion_matrix(y_true, y_pred))\n",
    "print(\"\\nClassification Report:\\n\", classification_report(y_true, y_pred, target_names=[\"Truth\",\"Lie\"]))\n",
    "print(\"AUC (ROC): %.2f\" % roc_auc_score(y_true, y_scores))"
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "### Bias and Fairness\n",
    "It's crucial to assess the model's performance across different demographic or situational subsets to ensure fairness​:contentReference[oaicite:28]{index=28}. For example, we should check if the system is equally accurate for people of different genders, ethnicities, dialects, etc. If we notice performance gaps, techniques like re-balancing the training data or algorithmic fairness adjustments (e.g., using IBM's AIF360 toolkit) can help​:contentReference[oaicite:29]{index=29}.\n",
    "\n",
    "We also test the system's robustness:\n",
    "- Try intentionally noisy or low-quality inputs (blurry video, loud background noise in audio) to see if the system still performs reasonably​:contentReference[oaicite:30]{index=30}.\n",
    "- Ensure that the system fails gracefully (perhaps by increasing uncertainty) rather than giving confident false outputs when data is poor.\n",
    "\n",
    "By conducting these tests, we aim to catch issues like overfitting, bias, or instability early and address them before deployment."
    ]
    },
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "## 7. Ethical Considerations & Responsible AI Use\n",
    "\n",
    "Implementing a lie detection system raises serious ethical and legal questions. We must address these to use the technology responsibly:\n",
    "\n",
    "- **Accuracy and Consequences**: No lie detector is 100% accurate. False positives (labeling truthful people as liars) can cause unjust harm, and false negatives (missing a lie) can be security risks​:contentReference[oaicite:31]{index=31}. Thus, our system provides confidence scores and flags uncertain cases rather than making absolute judgments​:contentReference[oaicite:32]{index=32}. A human should always double-check important decisions.\n",
    "\n",
    "- **Bias and Fairness**: AI models can inadvertently be biased. If the training data isn't diverse, the system might be less accurate for certain groups (e.g., due to differences in facial expressions or speech patterns across cultures). We must strive to train on diverse data and test for bias. As one EU politician noted regarding AI lie detectors: *\"It will discriminate against anyone who is disabled or who has an anxious personality. It will not work.\"*​:contentReference[oaicite:33]{index=33}. We must be vigilant that our system does not unfairly target certain traits or communities.\n",
    "\n",
    "- **Privacy**: By nature, this system analyzes personal and biometric data (faces, voices, physiological signals). Under privacy laws like GDPR, such data is highly sensitive​:contentReference[oaicite:34]{index=34}. We should obtain informed consent from subjects, ensure data is securely stored or processed locally, and allow individuals to opt-out. Only the necessary data for the analysis should be collected, and it should be deleted after use unless explicitly consented for storage.\n",
    "\n",
    "- **Legal Compliance**: In some jurisdictions, using AI for lie detection (especially in law enforcement or hiring) could be regulated or even prohibited. The upcoming EU AI Act, for example, classifies \"emotion recognition\" systems as high-risk​:contentReference[oaicite:35]{index=35}. Deployers must ensure they follow all relevant laws and regulations. Also, this system should complement human judgment, not replace it​:contentReference[oaicite:36]{index=36}. For critical decisions (like criminal investigations), AI output should not be the sole evidence.\n",
    "\n",
    "- **Pseudoscience and Limitations**: The scientific community is still debating how effective AI is at detecting deception. Some critics call these systems \"pseudoscience\" if claimed to be foolproof​:contentReference[oaicite:37]{index=37}. We acknowledge that this tool has limitations and should not be considered a magical truth machine. It's an assistive tool that highlights potential signs of deceit, which a human expert must interpret with caution​:contentReference[oaicite:38]{index=38}. Transparency about the system's accuracy and caveats is essential.\n",
    "\n",
    "- **Ethical Use Policies**: Anyone deploying such a system should have clear policies: when it is appropriate to use (and when not), who has access to the results, and how to ensure accountability. Logs of the agent's reasoning and human interventions should be kept (for example, to audit decisions)​:contentReference[oaicite:39]{index=39}. Users of the system should be trained in understanding its outputs and the uncertainty involved. Ultimately, the goal is to aid truth-finding, not to unfairly accuse innocent people or violate privacy.\n",
    "\n",
    "By considering these factors, we aim to develop and deploy the lie detection system in a way that is **fair, transparent, and accountable**. Responsible AI use isn't just a final step – it's a continuous process of monitoring and improving the system in the real world."
    ]
    }
    ]
    }
  3. @ruvnet ruvnet revised this gist Feb 8, 2025. 1 changed file with 0 additions and 2 deletions.
    2 changes: 0 additions & 2 deletions Liar-Ai.md
    Original file line number Diff line number Diff line change
    @@ -25,8 +25,6 @@ Recursive uncertainty estimation (yes, reallly) ensures that when modalities (a

    But with great power comes great responsibility. This tool reveals the truth, but how you use it is up to you.

    See: https://lnkd.in/gAVVhqQU

    This tutorial presents a comprehensive, PhD-level guide to building a multi-modal lie detection system that leverages an agentic approach with ReAct (Reasoning + Acting). The system integrates best-of-class AI techniques to process multiple sensory inputs—including vision, audio, text, and physiological signals—and uses a human-in-the-loop framework for decision management and continuous improvement. Designed for researchers and advanced practitioners, this document details the architecture, technical implementation, and ethical considerations needed to create a responsible and interpretable deception detection system.

    ---
  4. @ruvnet ruvnet revised this gist Feb 8, 2025. 1 changed file with 20 additions and 0 deletions.
    20 changes: 20 additions & 0 deletions Liar-Ai.md
    Original file line number Diff line number Diff line change
    @@ -7,6 +7,26 @@

    ## WTF? The world's most powerful lie dector.

    🤯 Zoom calls will never be the same. I think I might have just created the world’s most powerful lie detector tutorial using deep research.

    This isn’t just another AI gimmick—it’s a multi-modal deception detection system that leverages neurosymbolic AI, recursive reasoning, and reinforcement learning to analyze facial expressions, vocal stress, and linguistic cues in real time. I used OpenAi Deep Research to build it, and it appears to work. (Tested on the nightly news)

    I built this for good.

    AI can be used for a lot of things, and not all of them are great. So I asked myself: What if I could level the playing field?

    What if the most advanced lie detection technology wasn’t locked away in government labs or corporate surveillance tools, but instead available to everyone? With the right balance of transparency, explainability, and human oversight, this system can be a powerful tool for truth-seeking—whether in negotiations, investigations, or just cutting through deception in everyday conversations.

    This system isn’t just a classifier—it’s an agentic reasoning system, built on ReAct (Reasoning + Acting) and recursive decision-making models. It doesn’t just detect deception; it thinks through its own process, iteratively refining its conclusions based on multi-modal evidence.

    It applies reinforcement learning strategies to improve its judgment over time and neurosymbolic logic to merge deep learning’s pattern recognition with structured rule-based inference.

    Recursive uncertainty estimation (yes, reallly) ensures that when modalities (audio, visual or sensory) disagree or confidence is low, the system adapts—either requesting additional data, consulting prior knowledge, or deferring to human oversight. This makes it far more than just a deep learning model—it’s an adaptive reasoning engine for deception analysis.

    But with great power comes great responsibility. This tool reveals the truth, but how you use it is up to you.

    See: https://lnkd.in/gAVVhqQU

    This tutorial presents a comprehensive, PhD-level guide to building a multi-modal lie detection system that leverages an agentic approach with ReAct (Reasoning + Acting). The system integrates best-of-class AI techniques to process multiple sensory inputs—including vision, audio, text, and physiological signals—and uses a human-in-the-loop framework for decision management and continuous improvement. Designed for researchers and advanced practitioners, this document details the architecture, technical implementation, and ethical considerations needed to create a responsible and interpretable deception detection system.

    ---
  5. @ruvnet ruvnet created this gist Feb 8, 2025.
    509 changes: 509 additions & 0 deletions Liar-Ai.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,509 @@
    # Multi-Modal Lie Detection System using an Agentic ReAct Approach: Step-by-Step Tutorial

    **Author:** rUv
    **Created by:** rUv, cause he could

    ---

    ## WTF? The world's most powerful lie dector.

    This tutorial presents a comprehensive, PhD-level guide to building a multi-modal lie detection system that leverages an agentic approach with ReAct (Reasoning + Acting). The system integrates best-of-class AI techniques to process multiple sensory inputs—including vision, audio, text, and physiological signals—and uses a human-in-the-loop framework for decision management and continuous improvement. Designed for researchers and advanced practitioners, this document details the architecture, technical implementation, and ethical considerations needed to create a responsible and interpretable deception detection system.

    ---

    ## Introduction

    Detecting deception has long been a challenge in fields such as security, law enforcement, and psychology. Traditional methods like the polygraph are controversial and error-prone, as even experienced human observers often struggle with the subtle cues of lying. Modern AI-driven approaches aim to overcome these limitations by combining multiple modalities—such as facial expressions, vocal stress, linguistic cues, and physiological signals—to build a more accurate picture of a subject’s truthfulness. This tutorial demonstrates how to construct a multi-modal lie detection system that not only fuses diverse sensory data but also employs an agentic ReAct framework to generate interpretable reasoning traces and decisions. By integrating human oversight, the system supports ethical, privacy-aware, and accountable decision-making.

    ---

    ## Table of Contents

    1. [Introduction](#introduction)
    2. [Features](#features)
    3. [Architecture](#architecture)
    - 3.1 Modality-Specific Analysis Pipelines
    - 3.2 Feature Fusion Layer
    - 3.3 Agent-Based Reasoning (ReAct Agent)
    - 3.4 Neuro-Symbolic Reasoning Module
    - 3.5 Database and Logging (Knowledge Base)
    4. [Technical Details](#technical-details)
    5. [Complete Code](#complete-code)
    - 5.1 Setting Up the Project and Dependencies
    - 5.2 Project File/Folder Structure
    - 5.3 Implementing the Vision Model (Facial Analysis)
    - 5.4 Implementing the Audio Model (Speech Analysis)
    - 5.5 Implementing the Text Model (NLP Analysis)
    - 5.6 (Optional) Fusion Model
    - 5.7 Implementing the Agent with ReAct Reasoning
    - 5.8 Main Script and CLI Interface
    6. [Human-in-the-Loop Integration](#human-in-the-loop-integration)
    7. [Testing and Evaluation](#testing-and-evaluation)
    8. [Ethical Considerations](#ethical-considerations)
    9. [References](#references)

    ---

    ## 1. Introduction

    Detecting deception—determining if someone is lying—is a longstanding challenge in areas such as security and law enforcement. Traditional methods (e.g., the polygraph) rely on measuring physiological responses like heart rate and perspiration but have a well-documented history of unreliability. Human observers, even trained professionals, can find it difficult to accurately identify lies because deceptive cues are often subtle and varied.

    Modern AI-driven deception detection aims to address these limitations by analyzing multiple data sources simultaneously. Beyond physiological signals, systems now examine visual, auditory, and linguistic cues. By integrating these modalities, the approach captures a richer picture of behavior than any single channel can provide. Studies have shown that multi-modal analysis improves performance in lie detection. For example, by fusing visual, auditory, and textual data, researchers have achieved significant accuracy gains compared to using any single modality alone.

    In addition, there is growing emphasis on transparency. High-accuracy models must also offer explainability to justify decisions, especially in sensitive applications. Techniques such as attention visualization, feature importance scoring, and the ReAct reasoning framework help demystify how the system reaches its conclusions.

    This tutorial guides you through designing and implementing a multi-modal lie detection system that leverages advanced deep learning models, sensor fusion, and an agent-based reasoning process. Human oversight is integrated to ensure that the system remains interpretable, accountable, and ethically sound.

    ---

    ## 2. Features

    Our multi-modal lie detection system offers the following key features:

    - **Multi-Modal Data Fusion:**
    Processes and fuses information from diverse sources—facial video (for micro-expressions and gaze), audio (for voice stress and tone), textual transcripts (for linguistic cues), and, when available, physiological sensor data (e.g., heart rate, skin conductance). This approach captures a comprehensive range of deception indicators.

    - **Explainability & Interpretability:**
    Provides human-understandable explanations for its decisions by highlighting influential cues (e.g., elevated voice pitch or incongruent facial expressions). Techniques such as attention visualization and feature importance scoring (using methods like LIME/SHAP) make the inner workings transparent.

    - **Real-Time and Batch Processing:**
    Supports both real-time streaming analysis and offline batch processing, allowing instantaneous assessments during an interview or post-analysis of recorded sessions.

    - **Human-in-the-Loop Oversight:**
    Integrates human expertise into the decision-making process. Experts can review, validate, or override AI decisions, and their feedback is used to continuously improve the model.

    - **Privacy-Preserving Architecture:**
    Designed with data protection in mind, the system processes sensitive biometric data in a privacy-aware manner. Techniques such as on-device processing, federated learning, and data anonymization ensure compliance with privacy regulations.

    ---

    ## 3. Architecture

    The system architecture is modular and pipeline-based, with the following main components:

    ### 3.1 Modality-Specific Analysis Pipelines

    - **Vision Pipeline:**
    Processes video or images using computer vision techniques. A deep Convolutional Neural Network (CNN) analyzes facial expressions, micro-expressions, gaze, and body language to produce features indicative of deception.

    - **Audio Pipeline:**
    Analyzes speech using a deep learning model (e.g., LSTM, 1D-CNN, or pre-trained transformer like Wav2Vec 2.0) to extract acoustic features such as pitch, jitter, and speech rate that may signal stress or deception.

    - **Text/NLP Pipeline:**
    Evaluates linguistic cues in transcripts using a Transformer-based classifier (such as BERT or RoBERTa) to identify language patterns associated with deceptive speech.

    - **Physiological Pipeline:**
    When available, processes sensor data (e.g., heart rate, skin conductance) to detect anomalies associated with deception.

    ### 3.2 Feature Fusion Layer

    The fusion layer combines the outputs of each modality. Options include:

    - **Early Fusion:** Combining raw features into one vector.
    - **Late Fusion:** Independently generating deception scores for each modality and merging them (e.g., via a weighted average or meta-classifier).
    - **Hybrid Fusion:** Employing an attention mechanism to dynamically weigh modalities.

    Our implementation demonstrates a late fusion approach for simplicity while allowing for future extension.

    ### 3.3 Agent-Based Reasoning (ReAct Agent)

    At the core is an intelligent agent that employs the ReAct paradigm. It iteratively generates internal reasoning traces (e.g., “Facial cues suggest stress, but vocal analysis is moderate”) and takes actions such as querying additional data or flagging ambiguous cases for human review. This interleaved reasoning and acting process produces a final decision with an interpretable explanation.

    ### 3.4 Neuro-Symbolic Reasoning Module

    This module integrates neural network outputs with symbolic rules to enforce domain knowledge. For instance, a rule might state: “If text content contradicts facial emotion, increase the deception probability.” This neuro-symbolic approach enhances robustness and interpretability.

    ### 3.5 Database and Logging (Knowledge Base)

    A persistent storage component logs:
    - Inputs and extracted features,
    - The agent’s reasoning trace and final decision,
    - Human feedback (when available).

    This log serves both as a knowledge base for context-aware decisions and as an audit trail for compliance and continuous model improvement.

    ---

    ## 4. Technical Details

    Key technical considerations include:

    - **Deep Learning for Each Modality:**
    Each modality uses state-of-the-art models. For facial analysis, a CNN (or a pre-trained network such as ResNet50) is fine-tuned on emotion datasets. For audio, pre-trained models like Wav2Vec 2.0 provide rich representations. For text, Transformer-based models (BERT, RoBERTa) are fine-tuned on deception-related data.

    - **Sensor Fusion Techniques:**
    We implement late fusion by combining independent deception scores from each modality. Future extensions could employ attention-based fusion networks.

    - **Reinforcement Learning for Agent Decisions:**
    While the agent currently uses rule-based reasoning, it can be extended with reinforcement learning (using frameworks such as OpenAI Gym and stable-baselines) to optimize decision-making over time.

    - **Model Uncertainty Estimation:**
    Techniques like Monte Carlo dropout and ensemble methods provide confidence scores, allowing the agent to flag uncertain decisions.

    - **Explainable AI (XAI) Techniques:**
    Methods such as Grad-CAM for vision, SHAP/LIME for audio and text, and a detailed reasoning trace from the ReAct agent ensure that every decision is accompanied by a human-understandable explanation.

    ---

    ## 5. Complete Code

    Below is the complete implementation, organized into modules.

    ### 5.1 Setting Up the Project and Dependencies

    Use **Poetry** for dependency management. In your `pyproject.toml`, include:

    ```toml
    [tool.poetry.dependencies]
    python = ">=3.8,<3.12"
    torch = ">=2.0.0"
    torchvision = ">=0.15.0"
    transformers = ">=4.0.0"
    opencv-python = ">=4.5.0"
    librosa = ">=0.9.0"
    numpy = ">=1.20.0"
    ```

    ### 5.2 Project File/Folder Structure

    ```
    lie_detector/
    ├── data/ # Data files (e.g., sample videos, audio clips, transcripts)
    ├── models/ # Deep learning models for each modality
    │ ├── vision_model.py # Facial image analysis model
    │ ├── audio_model.py # Audio analysis model
    │ ├── text_model.py # NLP analysis model
    │ └── fusion_model.py # (Optional) Multi-modal fusion model
    ├── agents/
    │ └── lie_detect_agent.py # Agent implementing ReAct reasoning and decision logic
    ├── utils/ # Utility modules (data loading, preprocessing, explainability)
    │ ├── data_loader.py
    │ ├── preprocess.py
    │ └── xai.py
    ├── main.py # Main CLI script for training, evaluation, and real-time inference
    └── tests/ # Test scripts (unit and integration tests)
    ├── test_models.py
    └── test_agent.py
    ```

    ### 5.3 Implementing the Vision Model (Facial Analysis)

    ```python
    # models/vision_model.py
    import torch
    import torch.nn as nn
    import torchvision.transforms as T

    class VisionModel(nn.Module):
    def __init__(self):
    super(VisionModel, self).__init__()
    self.conv1 = nn.Conv2d(3, 16, kernel_size=5, stride=2)
    self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=2)
    self.fc1 = nn.Linear(32 * 6 * 6, 100)
    self.fc2 = nn.Linear(100, 1)
    self.relu = nn.ReLU()
    self.pool = nn.AdaptiveAvgPool2d((6,6))
    self.transform = T.Compose([
    T.ToTensor(),
    T.Resize((48,48)),
    T.Normalize(mean=[0.5,0.5,0.5], std=[0.5,0.5,0.5])
    ])

    def forward(self, x):
    x = self.relu(self.conv1(x))
    x = self.pool(x)
    x = self.relu(self.conv2(x))
    x = self.pool(x)
    x = x.view(x.size(0), -1)
    x = self.relu(self.fc1(x))
    score = self.fc2(x)
    return score

    def predict_deception(self, image):
    self.eval()
    with torch.no_grad():
    img_tensor = self.transform(image).unsqueeze(0)
    score = self.forward(img_tensor)
    prob = torch.sigmoid(score).item()
    return prob
    ```

    ### 5.4 Implementing the Audio Model (Speech Analysis)

    ```python
    # models/audio_model.py
    import numpy as np
    import librosa
    import torch
    import torch.nn as nn

    class AudioModel(nn.Module):
    def __init__(self):
    super(AudioModel, self).__init__()
    self.fc1 = nn.Linear(20, 32)
    self.fc2 = nn.Linear(32, 1)
    self.relu = nn.ReLU()

    def forward(self, feats):
    x = self.relu(self.fc1(feats))
    score = self.fc2(x)
    return score

    def extract_features(self, audio_path):
    y, sr = librosa.load(audio_path, sr=None, duration=5.0)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
    mfcc_mean = mfcc.mean(axis=1)
    return mfcc_mean

    def predict_deception(self, audio_path):
    self.eval()
    mfcc_feat = self.extract_features(audio_path)
    mfcc_tensor = torch.from_numpy(mfcc_feat).float().unsqueeze(0)
    with torch.no_grad():
    score = self.forward(mfcc_tensor)
    prob = torch.sigmoid(score).item()
    return prob
    ```

    ### 5.5 Implementing the Text Model (NLP Analysis)

    ```python
    # models/text_model.py
    import torch
    from transformers import AutoTokenizer, AutoModelForSequenceClassification

    class TextModel:
    def __init__(self, model_name="bert-base-uncased", num_labels=2):
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

    def predict_deception(self, text):
    self.model.eval()
    inputs = self.tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
    outputs = self.model(**inputs)
    logits = outputs.logits
    probs = torch.softmax(logits, dim=1)[0].cpu().numpy()
    deception_prob = float(probs[1])
    return deception_prob
    ```

    ### 5.6 (Optional) Fusion Model

    ```python
    # models/fusion_model.py
    import torch
    import torch.nn as nn

    class FusionModel(nn.Module):
    def __init__(self):
    super(FusionModel, self).__init__()
    self.fc = nn.Linear(3, 1)
    def forward(self, x):
    return self.fc(x)
    ```

    ### 5.7 Implementing the Agent with ReAct Reasoning

    ```python
    # agents/lie_detect_agent.py
    from models.vision_model import VisionModel
    from models.audio_model import AudioModel
    from models.text_model import TextModel

    class LieDetectAgent:
    def __init__(self):
    self.vision_model = VisionModel()
    self.audio_model = AudioModel()
    self.text_model = TextModel()
    self.thoughts = []

    def analyze(self, image=None, audio_file=None, text=None):
    self.thoughts = []
    scores = {}

    if image is not None:
    vision_prob = self.vision_model.predict_deception(image)
    scores['vision'] = vision_prob
    self.thoughts.append(f"Vision analysis: model returned probability {vision_prob:.2f} for deception.")
    if vision_prob > 0.7:
    self.thoughts.append("Thought: Facial cues (micro-expressions) suggest stress or deceit.")
    elif vision_prob < 0.3:
    self.thoughts.append("Thought: Facial expression appears normal/relaxed.")

    if audio_file is not None:
    audio_prob = self.audio_model.predict_deception(audio_file)
    scores['audio'] = audio_prob
    self.thoughts.append(f"Audio analysis: model returned probability {audio_prob:.2f} for deception.")
    if audio_prob > 0.7:
    self.thoughts.append("Thought: Voice features (pitch/tone) indicate high stress.")
    elif audio_prob < 0.3:
    self.thoughts.append("Thought: Voice does not show significant stress indicators.")

    if text is not None:
    text_prob = self.text_model.predict_deception(text)
    scores['text'] = text_prob
    self.thoughts.append(f"Text analysis: model returned probability {text_prob:.2f} for deception.")
    if text_prob > 0.7:
    self.thoughts.append("Thought: Linguistic analysis finds cues of deception in wording.")
    elif text_prob < 0.3:
    self.thoughts.append("Thought: Linguistic content appears consistent (no obvious deception cues).")

    if not scores:
    return {"decision": "No data", "confidence": 0.0, "explanation": "No input provided."}
    avg_score = sum(scores.values()) / len(scores)
    self.thoughts.append(f"Fused probability (average) = {avg_score:.2f}.")

    if avg_score >= 0.5:
    decision = "Deceptive"
    conf = avg_score
    else:
    decision = "Truthful"
    conf = 1 - avg_score
    self.thoughts.append(f"Action: Based on combined score, decision = {decision}.")

    if 0.4 < avg_score < 0.6 and len(scores) > 1:
    spread = max(scores.values()) - min(scores.values())
    if spread > 0.5:
    self.thoughts.append("Thought: Modalities disagree significantly. Flagging for human review.")
    decision = decision + " (needs human review)"

    explanation = " ; ".join(self.thoughts)
    return {"decision": decision, "confidence": float(conf), "explanation": explanation, "scores": scores}
    ```

    ### 5.8 Main Script and CLI Interface

    ```python
    # main.py
    import argparse
    import json
    from agents.lie_detect_agent import LieDetectAgent

    def run_realtime(agent):
    print("Starting real-time lie detection. Press Ctrl+C to stop.")
    try:
    while True:
    print("Real-time capture not implemented in this demo.")
    break
    except KeyboardInterrupt:
    print("Stopping real-time detection.")

    def main():
    parser = argparse.ArgumentParser(description="Multi-modal Lie Detection System")
    subparsers = parser.add_subparsers(dest="command", required=True)

    train_parser = subparsers.add_parser("train", help="Train the models on a dataset (not implemented fully).")
    train_parser.add_argument("--data-dir", type=str, help="Path to training data")

    eval_parser = subparsers.add_parser("eval", help="Evaluate the system on given inputs.")
    eval_parser.add_argument("--image", type=str, help="Path to image file of face")
    eval_parser.add_argument("--audio", type=str, help="Path to audio file")
    eval_parser.add_argument("--text", type=str, help="Text input (surround in quotes)")

    live_parser = subparsers.add_parser("realtime", help="Run the system in real-time mode (webcam/mic)")

    args = parser.parse_args()

    if args.command == "train":
    print("Training mode selected. (Implement training loop to fit models on data).")

    elif args.command == "eval":
    agent = LieDetectAgent()
    result = agent.analyze(image=args.image, audio_file=args.audio, text=args.text)
    print(f"\nDecision: {result['decision']} (Confidence: {result['confidence']*100:.1f}%)")
    print(f"Explanation: {result['explanation']}")
    with open("analysis_log.json", "a") as logf:
    logf.write(json.dumps(result) + "\n")

    elif args.command == "realtime":
    agent = LieDetectAgent()
    run_realtime(agent)

    if __name__ == "__main__":
    main()
    ```

    ---

    ## 6. Human-in-the-Loop Integration

    Our system is designed to work with human experts. Key integration points include:

    - **Flagging for Review:**
    If modalities produce contradictory results or if the decision is borderline, the system flags the case for human review. In the output, such cases are marked accordingly.

    - **Expert Dashboard:**
    A dedicated interface (web or desktop) can display the video with facial landmarks, audio waveforms, and transcript highlights alongside the AI’s explanation, enabling experts to approve or override decisions.

    - **Feedback Loop:**
    Human feedback is logged and can be used to retrain or fine-tune the models. This active learning process continuously improves the system.

    - **Interface for Validation:**
    In command-line mode, a prompt may request human validation of the AI’s decision. In a full deployment, this would be integrated into a more user-friendly graphical interface.

    ---

    ## 7. Testing and Evaluation

    To ensure reliability, the system is subject to rigorous testing:

    - **Unit Tests:**
    Each component (e.g., VisionModel, AudioModel, TextModel) is tested for correct input/output behavior. For example, verifying that the VisionModel returns a probability in the expected range given a dummy input.

    - **Integration Tests:**
    The complete pipeline is tested on sample data to ensure that all components interact correctly. Tests also cover the CLI interface.

    - **Performance Evaluation:**
    The system is evaluated on benchmark datasets, measuring accuracy, precision, recall, F1, ROC curves, and confusion matrices. Special attention is given to false positives.

    - **Bias and Fairness Testing:**
    Performance is assessed across different demographic groups using fairness metrics. Techniques such as AIF360 may be used to quantify and mitigate bias.

    - **Robustness Testing:**
    The system is tested on degraded or noisy inputs (e.g., low-light images, noisy audio) to ensure graceful handling of errors.

    ---

    ## 8. Ethical Considerations

    Building an AI lie detection system raises important ethical issues that must be addressed:

    - **Accuracy and the Risk of Error:**
    No system is infallible. False positives (wrongly accusing someone of lying) and false negatives (missing deception) have serious consequences. The system is designed to provide probabilistic outputs and to flag uncertain cases for human review.

    - **Bias and Fairness:**
    Care is taken to ensure that training data is diverse and that the system’s performance is consistent across demographic groups. Bias detection and mitigation techniques are integrated to avoid discriminatory outcomes.

    - **Privacy:**
    Since the system processes sensitive biometric data (faces, voices, physiological signals), privacy is a top priority. Data is processed in a privacy-preserving manner (e.g., on-device processing, encryption, anonymization), and user consent is mandatory.

    - **Legality and Compliance:**
    Deployment in sensitive domains (e.g., law enforcement) requires strict adherence to legal standards and ethical guidelines. The system is designed to augment human decision-making rather than serve as the sole basis for critical decisions.

    - **Pseudoscience Concerns and Limitations:**
    Given ongoing debates about the reliability of lie detection, the system is presented as an assistive tool. Its outputs are not intended to be used as standalone evidence, and full disclosure of its limitations is required.

    - **Ethical Use Policies:**
    Clear policies must be established regarding when and how the system is used. Transparency, accountability, and the right for individuals to contest decisions are essential components of ethical deployment.

    ---

    ## 9. References

    1. [4] Details on the reliability issues of traditional polygraph tests.
    2. [9] Studies on multi-modal integration in deception detection.
    3. [22] Research on deception detection using visual, auditory, and textual data (including works by Sehrawat et al. and Gupta et al.).
    4. [6] The ReAct reasoning framework for agent-based systems.
    5. [17] Guidelines and techniques for privacy-preserving AI.
    6. [19] Research on facial micro-expression detection.
    7. [21] Developments in audio analysis, including the Wav2Vec 2.0 model.
    8. [10] Techniques for sensor fusion and decision-level (late) fusion.
    9. [14] Advances in neuro-symbolic reasoning in AI.
    10. [26] Evaluations and metrics for bias and fairness in AI.
    11. [25] Considerations regarding privacy and legal aspects of biometric data.
    12. [23] Critiques and limitations of AI lie detection systems.

    ---

    End of Tutorial.