<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:code-available-feed="tag:code-available-feed.github.io,2026:atom-extensions">
  <title>cs.SD strict=false code-available-feed/code-available-feed</title>
  <id>https://code-available-feed.github.io/code-available-feed/arxiv/cs.sd/atom.xml</id>
  <updated>2026-06-09T14:21:45Z</updated>
  <link rel="self" type="application/atom+xml" href="https://code-available-feed.github.io/code-available-feed/arxiv/cs.sd/atom.xml" />
  <link rel="alternate" type="text/html" href="https://github.com/code-available-feed/code-available-feed" />
  <entry>
    <title>[cs.SD] What Do Deepfake Speech Detectors Actually Hear?</title>
    <author>
      <name>Vojtěch Staněk</name>
    </author>
    <author>
      <name>Veronika Jirmusová</name>
    </author>
    <author>
      <name>Anton Firc</name>
    </author>
    <author>
      <name>Kamil Malinka</name>
    </author>
    <author>
      <name>Jakub Reš</name>
    </author>
    <author>
      <name>Martin Perešíni</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10912v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10912v1" />
    <published>2026-06-09T14:21:45Z</published>
    <updated>2026-06-09T14:21:45Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Vojtěch Staněk et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence over time. We apply the proposed method to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on ASVspoof 5 and manually annotate the highest-attribution regions to provide a semantic meaning of the most important cues. Despite similar performance, the detectors rely on different cues: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. We move beyond speculative reasoning and validate our findings by causal masking of the primary detector cues. Observed performance degradation further supports the explained detector semantics.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Ethical and Technical Limits of Deepfake Speech Datasets</title>
    <author>
      <name>Vojtěch Staněk</name>
    </author>
    <author>
      <name>Eva Trnovská</name>
    </author>
    <author>
      <name>Kamil Malinka</name>
    </author>
    <author>
      <name>Anton Firc</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10911v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10911v1" />
    <published>2026-06-09T14:20:55Z</published>
    <updated>2026-06-09T14:20:55Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Vojtěch Staněk et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Claims about the robustness and fairness of deepfake speech detectors are only as credible as the datasets used to train and evaluate those systems. We present a dataset-level audit of the deepfake speech landscape. We compile and analyze 39 deepfake speech datasets, examining key attributes including accessibility, documentation, demographic and language coverage, dataset scale, and the underlying bona fide speech sources. Our audit reveals two important takeaways. Firstly, fairness assessment is largely infeasible because most datasets lack demographic metadata, and only a few contain gender or language labels. This prevents any meaningful subgroup analysis and leaves other demographic attributes unaddressed. Secondly, we identify substantial overlap in underlying bona fide source corpora across datasets, which can undermine cross-dataset evaluation and lead to overstated generalization claims.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] RAT: Reference-Augmented Training for ASV Anti-Spoofing</title>
    <author>
      <name>Vojtěch Staněk</name>
    </author>
    <author>
      <name>Anton Firc</name>
    </author>
    <author>
      <name>Jakub Reš</name>
    </author>
    <author>
      <name>Kamil Malinka</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10908v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10908v1" />
    <published>2026-06-09T14:20:05Z</published>
    <updated>2026-06-09T14:20:05Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Vojtěch Staněk et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We introduce a spoofing countermeasure architecture conditioned on speaker-reference recordings, but observe that it converges to a solution that effectively ignores the reference during inference. Surprisingly, training with a reference channel induces invariance that improves deepfake detection, even when the reference is absent or mismatched during inference. Based on this observation, we propose a Reference-Augmented Training (RAT) strategy. RAT yields improved detection performance compared to single-utterance baselines, even when the reference recording is replaced with a zero vector at inference. Through rigorous analysis, we demonstrate that the optimization process rapidly diminishes the reference contributions, leading to inference largely independent of the reference channel. Using RAT, we achieve state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, surpassing even large ensemble systems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Overview of ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge</title>
    <author>
      <name>Xueping Zhang</name>
    </author>
    <author>
      <name>Han Yin</name>
    </author>
    <author>
      <name>Yang Xiao</name>
    </author>
    <author>
      <name>Lin Zhang</name>
    </author>
    <author>
      <name>Ting Dang</name>
    </author>
    <author>
      <name>Rohan Kumar Das</name>
    </author>
    <author>
      <name>Ming Li</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10791v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10791v1" />
    <published>2026-06-09T12:42:14Z</published>
    <updated>2026-06-09T12:42:14Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xueping Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), held in conjunction with ICME 2026, evaluated systems for five component-level audio spoofing detection, where speech and environmental sounds may be manipulated independently or jointly. After the challenge concludes, we analyze the final leaderboard and summarize effective design choices from the top-performing submissions. The challenge attracted 94 registrations from 16 countries; after verification of submission requirements and metadata, 13 teams were retained for the final analysis. On the test set, the best system achieved a Macro-F1 score of 0.8775, substantially outperforming the separation-enhanced joint learning baseline (0.6327). Top systems consistently benefited from modular task decomposition, cross-domain self-supervised encoders, targeted data augmentation, and selective ensembling rather than simple model scaling. At the same time, auxiliary EER analyses reveal persistent difficulty in detecting the spoofed environmental component and in generalizing to unseen generators in the test set. This paper reports challenge results and provides insights for future environment-aware deepfake detection research. The CompSpoofV2 dataset and baseline code remain publicly available for reproducibility.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to 2026 ICME workshop&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] ContextCodec: Content-Focused Context Guidance for Ultra-Low Bitrate Speech Coding</title>
    <author>
      <name>Chengbin Liang</name>
    </author>
    <author>
      <name>Wenqi Guo</name>
    </author>
    <author>
      <name>Hao Cao</name>
    </author>
    <author>
      <name>Zhijin Qin</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10591v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10591v1" />
    <published>2026-06-09T08:55:47Z</published>
    <updated>2026-06-09T08:55:47Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Chengbin Liang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Neural speech codecs enable low-bitrate speech communication, yet at ultra-low bitrates (&amp;lt; 1000 bps) preserving perceptual quality and intelligibility is challenging. Existing designs often prioritize acoustic details, leaving limited capacity for the core linguistic message under tight bitrate constraints. To address this, we propose ContextCodec, a codec that transmits content-focused context features to explicitly guide reconstruction. ContextCodec adopts a dual-branch encoder that decouples acoustic details from content-focused context. The context branch is trained with a CLIP-style contrastive loss that aligns context features with phoneme indices, reducing paralinguistic leakage. During decoding, these features are injected at each decoding stage for explicit guidance. In addition, we introduce a lightweight autoregressive latent refinement module. Experiments show a strong quality-intelligibility trade-off down to 500 bps, with an RTF of 0.4886 on a typical mobile CPU.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at Interspeech 2026. 6 pages, 2 figures, 5 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] Entropy-Aware Domain-Routed Mixture-of-Experts Speech-LLM Framework: A Case Study of Multi-Domain Child-Adult ASR</title>
    <author>
      <name>Mohan Shi</name>
    </author>
    <author>
      <name>Kaiyuan Zhang</name>
    </author>
    <author>
      <name>Zilai Wang</name>
    </author>
    <author>
      <name>Natarajan Balaji Shankar</name>
    </author>
    <author>
      <name>Eray Eren</name>
    </author>
    <author>
      <name>Abeer Alwan</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10454v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10454v1" />
    <published>2026-06-09T06:02:31Z</published>
    <updated>2026-06-09T06:02:31Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Mohan Shi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While Speech Large Language Models (Speech-LLMs) have achieved strong performance on adult Automatic Speech Recognition (ASR), their effectiveness on child speech remains under-explored, and single models often struggle to handle diverse adult and child age groups simultaneously. This paper proposes a Mixture-of-Experts (MoE) Speech-LLM for unified ASR across adult and child speech spanning diverse environments and age groups. The framework employs a Classifier-based Domain Router (C-DR) with a coarse-to-fine strategy and integrates both a Mixture-of-Projectors (MoP) and a Mixture-of-LoRAs (MoL) to model domain-specific variations. To address routing uncertainty near domain boundaries, an Entropy-Aware Routing (EAR) mechanism is introduced to dynamically incorporate a shared expert. Experiments on public child corpora demonstrate consistent improvements over baselines while preserving adult ASR performance. To our knowledge, this is the first work leveraging Speech-LLMs for unified, multi-domain ASR encompassing both children and adults.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling</title>
    <author>
      <name>Guodong Lin</name>
    </author>
    <author>
      <name>Ziqi Chen</name>
    </author>
    <author>
      <name>Yuxiang Fu</name>
    </author>
    <author>
      <name>Ke Li</name>
    </author>
    <author>
      <name>Wei-Qiang Zhang</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10439v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10439v1" />
    <published>2026-06-09T05:35:31Z</published>
    <updated>2026-06-09T05:35:31Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Guodong Lin et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by ICASSP 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation</title>
    <author>
      <name>Xuanchen Li</name>
    </author>
    <author>
      <name>Tianrui Wang</name>
    </author>
    <author>
      <name>Yuheng Lu</name>
    </author>
    <author>
      <name>Zikang Huang</name>
    </author>
    <author>
      <name>Yu Jiang</name>
    </author>
    <author>
      <name>Chenghan Lin</name>
    </author>
    <author>
      <name>Chenrui Cui</name>
    </author>
    <author>
      <name>Ziyang Ma</name>
    </author>
    <author>
      <name>Xingyu Ma</name>
    </author>
    <author>
      <name>Chunyu Qiang</name>
    </author>
    <author>
      <name>Guochen Yu</name>
    </author>
    <author>
      <name>Xie Chen</name>
    </author>
    <author>
      <name>Longbiao Wang</name>
    </author>
    <author>
      <name>Jianwu Dang</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10368v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10368v1" />
    <published>2026-06-09T03:27:30Z</published>
    <updated>2026-06-09T03:27:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xuanchen Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. Built upon the pre-trained Embedded Language Flows (ELF) backbone, ELF-S2T processes speech via a frozen Whisper encoder and a single linear projector, prepending the resulting audio condition to the noisy text latent for in-context, flow-matching denoising. To prevent the model from over-relying on its pre-trained text context, we introduce audio forcing during training, and further amplify the audio condition via classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 show that ELF-S2T achieves competitive ASR and S2TT performance. Crucially, our error analysis reveals that, although ASR and S2TT errors look very different on the surface, both stem from the same underlying cause, a close distance confusion in the continuous latent space. This finding naturally aligns with the continuous representation generation paradigm, indicating a common semantic mapping process beneath recognition and translation. Our code and pretrained models are publicly available at https://github.com/Sslnon/ELF-S2T.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] KFC-KWS: Keyframe Fusion with CTC for User-Defined Keyword Spotting</title>
    <author>
      <name>Jin Li</name>
    </author>
    <author>
      <name>Wenbin Jiang</name>
    </author>
    <author>
      <name>Ji Hu</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10365v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10365v1" />
    <published>2026-06-09T03:24:24Z</published>
    <updated>2026-06-09T03:24:24Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jin Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;User-defined keyword spotting (KWS) enables personalized voice interaction by detecting user-specified keywords. A key challenge in this task is distinguishing target keywords from phonetically confusable alternatives. To address this challenge, we propose KFC-KWS, a multimodal framework that leverages connectionist temporal classification (CTC)-guided keyframe selection. Specifically, we exploit the peaky posterior distributions of CTC to identify high-confidence phoneme frames, enabling precise alignment across audio, phoneme, and text modalities. These keyframes are then fused with full-utterance representations through cross-attention to capture both local discriminative cues and global contextual information. On LibriPhrase, KFC-KWS achieves the best-balanced performance (98.73% AUC) and substantially outperforms advanced baselines on the challenging hard subset (97.65% AUC and 7.75% EER), demonstrating its effectiveness in discriminating between highly confusable keywords.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] ViP-VL: Vietnamese Self-supervised Speech Pretraining Model with Vector-Quantization Learning</title>
    <author>
      <name>Khanh Le</name>
    </author>
    <author>
      <name>Kiet Anh Hoang</name>
    </author>
    <author>
      <name>Bao Nguyen</name>
    </author>
    <author>
      <name>Duy Vo</name>
    </author>
    <author>
      <name>Dung Vo</name>
    </author>
    <author>
      <name>Thai Tran</name>
    </author>
    <author>
      <name>Linh Pham</name>
    </author>
    <author>
      <name>Khoa D Doan</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10360v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10360v1" />
    <published>2026-06-09T03:21:40Z</published>
    <updated>2026-06-09T03:21:40Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Khanh Le et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present ViP-VL, an efficient Vietnamese Self-supervised speech Pretraining model leveraging Vector-quantization Learning. To bridge the gap between high-resolution audio and efficient processing, ViP-VL incorporates Acoustic Stacking and Receptive Field Alignment to enable a synchronized 8x subsampling rate within the ChunkFormer architecture, while further enhancing representation robustness through a specialized Mask Selection Strategy during pretraining on the BEST-RQ framework. Pretrained on 17,000 hours of unlabeled Vietnamese speech, our model establishes new state-of-the-art results across four major downstream tasks: Automatic Speech Recognition, Speech Emotion Recognition, Dialect Classification, and Speaker Verification. To facilitate future research and the development of high-performance Vietnamese speech technologies, we publicly release our pretrained weights and implementation at github.com/khanld/chunkformer.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;INTERSPEECH 2026, 6 pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] SSL-GMMVC: Interpretable Voice Conversion via Locally Linear GMM Transforms in Self-Supervised Representation Space</title>
    <author>
      <name>Tomoya Tanabu</name>
    </author>
    <author>
      <name>Hiroshi Nishijima</name>
    </author>
    <author>
      <name>Daisuke Saito</name>
    </author>
    <author>
      <name>Nobuaki Minematsu</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10317v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10317v1" />
    <published>2026-06-09T02:14:11Z</published>
    <updated>2026-06-09T02:14:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tomoya Tanabu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We introduce SSL-GMMVC, an interpretable voice conversion method in self-supervised speech space. The method models paired source-target features with a Gaussian mixture model and performs conversion as a posterior-weighted sum of affine transforms. This yields locally linear transformations that adapt to heterogeneous feature-space structure while remaining analytically tractable. Through objective and subjective evaluations, we show that SSL-GMMVC improves speaker similarity with comparable intelligibility and naturalness, and that even a constrained covariance variant surpasses a deep learning baseline as the number of mixture components increases. Further analyses link component selection to phonetic structure and reveal interpretable scaling and rotation in the learned transforms. These findings highlight SSL-GMMVC as an effective, analyzable framework for voice conversion.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refinement for Joint Multi-Resolution Speech Quality Modeling</title>
    <author>
      <name>Zhuoyan Tao</name>
    </author>
    <author>
      <name>Jiatong Shi</name>
    </author>
    <author>
      <name>Hye-jin Shim</name>
    </author>
    <author>
      <name>Shinji Watanabe</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10233v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10233v1" />
    <published>2026-06-08T22:46:30Z</published>
    <updated>2026-06-08T22:46:30Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhuoyan Tao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While speech quality is typically assessed on complete utterances, streaming and generative systems require incremental estimation from partial audio. Existing predictors assume full context, degrading on prefix-constrained inputs. Extending ARECHO, we propose ANCHOR, reformulating incremental assessment as a multi-resolution autoregressive task. It models chunk- and utterance-level quality within a single decoder using dual-resolution tokens and a resolution-aware hierarchy for coarse-to-fine refinement. Experiments show substantial robustness under partial input, including a 48% PLCMOS error reduction on 2-second prefixes. Convergence analysis reveals a 4-6 s effective perceptual context horizon. A stress test further isolates structured extrapolation biases under localized corruption. Results demonstrate that hierarchical supervision improves incremental prediction and elucidates how perceptual quality accumulates over time.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing</title>
    <author>
      <name>Awais Khan</name>
    </author>
    <author>
      <name>Kutub Uddin</name>
    </author>
    <author>
      <name>Khalid Malik</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10223v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10223v1" />
    <published>2026-06-08T22:22:48Z</published>
    <updated>2026-06-08T22:22:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Awais Khan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-dimensional descriptor that, unlike prior Linear Filter Bank (LFB)-only work, spans cepstral, oscillatory, rhythmic, energy, and spectral dimensions to capture complementary synthesis artifacts. Our analysis shows XLSR-53 remains discriminative in-domain (ID) while CORES generalizes stably under distribution shift (OOD), yet their naive concatenation fails due to SSL representational imbalance. To resolve this, an input-conditioned gate adaptively weights each branch under joint training with cross-entropy, an energy margin loss for ID/OOD separation, and a gate diversity term. On the MLAAD benchmark, our system achieves 97.6\% ID accuracy, 4.9\% EERc, and an 83.5\% relative FPR95 reduction over the Interspeech 2025 baseline.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment</title>
    <author>
      <name>Chien-Chun Wang</name>
    </author>
    <author>
      <name>Hung-Shin Lee</name>
    </author>
    <author>
      <name>Hsin-Min Wang</name>
    </author>
    <author>
      <name>Berlin Chen</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.10010v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.10010v1" />
    <published>2026-06-08T18:01:20Z</published>
    <updated>2026-06-08T18:01:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Chien-Chun Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman&amp;#x27;s rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to IEEE Signal Processing Letters (SPL)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration</title>
    <author>
      <name>Björn Þór Jónsson</name>
    </author>
    <author>
      <name>Çağrı Erdem</name>
    </author>
    <author>
      <name>Stefano Fasciani</name>
    </author>
    <author>
      <name>Kyrre Glette</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09780v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09780v1" />
    <published>2026-06-08T17:40:09Z</published>
    <updated>2026-06-08T17:40:09Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Björn Þór Jónsson et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;This study addresses the challenges composers and sound designers face in creating and refining tools to achieve their musical goals. Using evolutionary processes to promote diversity and foster serendipitous discoveries, we automate the search through uncharted sonic spaces for sound discovery, arguing that diversity-promoting algorithms can bridge the gap between the theoretical realisation and practical accessibility of sounds. We describe a system for generative sound synthesis combining Quality Diversity (QD) algorithms with a supervised discriminative model, inspired by the Innovation Engine algorithm, and explore different configurations and the interplay between the chosen synthesis approach and the discriminative model. We examine the interaction between Compositional Pattern Producing Networks (CPPNs) and Digital Signal Processing (DSP) graphs, introducing a novel approach that uses multiple specialised CPPNs for different frequency ranges; this yields simpler networks while maintaining performance comparable to single-CPPN setups. We also investigate evolutionary stepping stones by analysing goal switches between musical and non-musical contexts, revealing how lineages traverse unlikely paths to current elites. Expanding the behaviour space of a previous study to include various sound durations, we uncover specialisation within temporal niches. Results indicate that CPPN and DSP graphs coupled with a Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) and a deep learning classifier can generate a substantial variety of synthetic sounds, diverse and innovative across temporal and contextual dimensions. We present the generated sound objects through an online explorer and as rendered sound files, and, in the context of music composition, an experimental application that showcases their creative potential across various durations and contexts.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;This is an extended version of the previously published conference paper &amp;quot;Towards Sound Innovation Engines Using Pattern-Producing Networks and Audio Graphs&amp;quot;: https://doi.org/10.1007/978-3-031-56992-0_14&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study</title>
    <author>
      <name>Zhu Li</name>
    </author>
    <author>
      <name>Shekhar Nayak</name>
    </author>
    <author>
      <name>Matt Coler</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09717v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09717v1" />
    <published>2026-06-08T16:43:37Z</published>
    <updated>2026-06-08T16:43:37Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhu Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Prosody plays a central role in sarcasm perception, yet previous studies have relied on naturally produced speech that lacks fine-grained control over individual acoustic dimensions. As prosodic cues co-vary in natural data, isolating their independent contributions remains challenging. We introduce a controlled framework using neural text-to-speech (TTS) with prompt-based prosodic conditioning to manipulate speech rate, pitch variation, and loudness. An orthogonal stimulus set was constructed to enable causal testing of prosodic cue effects. Human listeners rated sarcasm and naturalness, and their judgments were compared with predictions from a foundation model capable of processing audio input. Results show that loudness primarily drives human sarcasm perception, whereas the model assigns greater weight to speech rate, leading to distinct cue-weighting patterns. This study shows how controllable neural TTS enables investigation of prosodic cue weighting in speech perception.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] RespiraMFM: A Multimodal Foundation Model with Contrastive Audio-Language Alignment for Respiratory Disease Identification</title>
    <author>
      <name>Shakhrul Iman Siam</name>
    </author>
    <author>
      <name>Tiantian Feng</name>
    </author>
    <author>
      <name>Jiankun Zhang</name>
    </author>
    <author>
      <name>Shrikanth Narayanan</name>
    </author>
    <author>
      <name>Mi Zhang</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09966v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09966v1" />
    <published>2026-06-08T16:29:59Z</published>
    <updated>2026-06-08T16:29:59Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shakhrul Iman Siam et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Respiratory diseases remain a leading cause of global mortality, where timely and accurate diagnosis is critical to improving patient outcomes and reducing healthcare burdens. While prior work has explored audio-based models for respiratory disease detection, such unimodal approaches often suffer from limited generalizability and diagnostic precision. In this paper, we propose RespiraMFM, a Multimodal Foundation Model that integrates respiratory sounds with patient medical history and symptoms to enhance diagnostic accuracy and disease detection capabilities. We introduce an effective contrastive alignment strategy for audio-text multimodal integration, allowing the model to learn better cross-modal representations between respiratory sounds and corresponding textual clinical information. We evaluate RespiraMFM across five major respiratory diseases using seven real-world datasets in both supervised fine-tuning and zero-shot settings, achieving a 9.15% improvement in AUROC on supervised tasks and a 20.98% gain on zero-shot tasks over existing baselines. These findings underscore the potential of our framework to advance early diagnosis and improve clinical decision-making in respiratory disease management.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;ACL 2026 Main Conference&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading</title>
    <author>
      <name>Eder del Blanco</name>
    </author>
    <author>
      <name>David Gimeno-Gómez</name>
    </author>
    <author>
      <name>Eva Navas</name>
    </author>
    <author>
      <name>Carlos-D. Martínez-Hinarejos</name>
    </author>
    <author>
      <name>Inma Hernáez</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09667v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09667v1" />
    <published>2026-06-08T15:50:51Z</published>
    <updated>2026-06-08T15:50:51Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Eder del Blanco et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulatory information, yet their integration for continuous speech synthesis remains underexplored. Moreover, existing multimodal approaches rarely address robustness to modality degradation or temporary sensor failure, limiting their applicability in realistic scenarios. In this work, we propose a masked multimodal speech synthesis framework that jointly leverages sEMG and lipreading signals through modality masking during training. Under multispeaker settings, the proposed approach reduces word error rate by up to 14 absolute percentage points compared to the strongest unimodal baseline. Experimental results not only show that masking strategies are critical for these performance gains and robustness under low-bitrate conditions, but also that they generalize better than degradation-specific data augmentations in the presence of modality absence conditions. Phone-level analyses further reveal complementary contributions across modalities, with particularly strong benefits for vowels and for specific consonant groups. Overall, these findings demonstrate the effectiveness and robustness of masked multimodal integration for silent speech synthesis, although adaptation to laryngectomized speakers remains an open research challenge.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;12 pages, 7 figures and 6 tables. Submitted to Transactions on Audio, Speech and Language Processing&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech</title>
    <author>
      <name>Vadim Popov</name>
    </author>
    <author>
      <name>Wenju Gu</name>
    </author>
    <author>
      <name>Tasnima Sadekova</name>
    </author>
    <author>
      <name>Georgii Aparin</name>
    </author>
    <author>
      <name>Assel Yermekova</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09962v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09962v1" />
    <published>2026-06-08T14:41:24Z</published>
    <updated>2026-06-08T14:41:24Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Vadim Popov et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scientific interest to such models has been constantly increasing these days because researchers try to achieve a challenging goal of finding reasonable alternatives to autoregressive large language models. In this paper, we study the properties of the structure of the latent space corresponding to discrete tokens expressed in terms of Kullback-Leibler divergence on diffusion path measures and accuracy of the correct token prediction by the optimally trained diffusion model. We find that FSQ tokenization scheme has the latent space structure with the properties that make it best suited for continuous diffusion for categorical data as verified through rigorous theoretical analysis and numerical experiments. To validate our findings in real-life scenario, we train several text-to-speech diffusion models having speech tokens as intermediate acoustic features, and show that the one based on FSQ tokens indeed performs the best, and, moreover, it outperforms its strong LLM-based counterpart, at the same time being significantly smaller and faster.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages</title>
    <author>
      <name>David Guzmán</name>
    </author>
    <author>
      <name>Luel Hagos Beyene</name>
    </author>
    <author>
      <name>Jesujoba Oluwadara Alabi</name>
    </author>
    <author>
      <name>Yejin Jeon</name>
    </author>
    <author>
      <name>Dietrich Klakow</name>
    </author>
    <author>
      <name>David Ifeoluwa Adelani</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09553v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09553v1" />
    <published>2026-06-08T14:30:48Z</published>
    <updated>2026-06-08T14:30:48Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;David Guzmán et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world&amp;#x27;s languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-resource speech synthesis spanning 37 underrepresented languages. Moreover, a systematic comparison of various TTS architectures and large-scale speech generation models is conducted across in-domain Biblical text and out-of-domain material. Results show that no single system dominates across languages and metrics: Gemini-TTS achieves the highest listener ratings on most evaluated languages, but monolingual EveryVoice models trained on OpenBibleTTS remain strongest for intelligibility and are preferred in several African languages, while open from-scratch systems degrade sharply on out-of-domain text, revealing a persistent gap between broad multilingual coverage and reliable synthesis quality in underserved linguistic communities. We complement automatic evaluation with subjective human judgments, and open-source all processed datasets, alignments, and trained models to support future low-resource TTS research.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages</title>
    <author>
      <name>Chowdam Venkata Kumar</name>
    </author>
    <author>
      <name>Kumud Tripathi</name>
    </author>
    <author>
      <name>Pankaj Wasnik</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09535v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09535v1" />
    <published>2026-06-08T14:18:51Z</published>
    <updated>2026-06-08T14:18:51Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Chowdam Venkata Kumar et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at INTERSPEECH 2026, 5 pages, 1 figure, 5 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation</title>
    <author>
      <name>Hanke Xie</name>
    </author>
    <author>
      <name>Xiaming Ren</name>
    </author>
    <author>
      <name>Dake Guo</name>
    </author>
    <author>
      <name>Ruonan You</name>
    </author>
    <author>
      <name>Wenhao Li</name>
    </author>
    <author>
      <name>Jingbin Hu</name>
    </author>
    <author>
      <name>Guobin Ma</name>
    </author>
    <author>
      <name>Huakang Chen</name>
    </author>
    <author>
      <name>Kejie Xu</name>
    </author>
    <author>
      <name>Rui Huang</name>
    </author>
    <author>
      <name>Weiguo Tan</name>
    </author>
    <author>
      <name>Xianrong Wang</name>
    </author>
    <author>
      <name>Lei Xie</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09141v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09141v2" />
    <published>2026-06-08T07:39:26Z</published>
    <updated>2026-06-09T03:52:24Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Hanke Xie et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities. These systems typically suffer from high end-to-end latency due to slow autoregressive prediction and multi-step flow matching. To address these limitations, we propose FlashTTS, an open-source and low-latency streaming TTS framework. FlashTTS introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, thereby eliminating the need for sentence-level buffering. To accelerate acoustic generation, we integrate parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder. This configuration achieves high-fidelity token-to-mel generation in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show that FlashTTS substantially reduces First-Packet Latency to 325ms compared to robust streaming baselines, all while preserving strong zero-shot voice cloning and cross-lingual intelligibility. Speech samples are available. The model code and checkpoints will be released as open source.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion</title>
    <author>
      <name>Guobin Ma</name>
    </author>
    <author>
      <name>Yuxuan Xia</name>
    </author>
    <author>
      <name>Yuepeng Jiang</name>
    </author>
    <author>
      <name>Dake Guo</name>
    </author>
    <author>
      <name>Hanke Xie</name>
    </author>
    <author>
      <name>Jingbin Hu</name>
    </author>
    <author>
      <name>Yanbo Wang</name>
    </author>
    <author>
      <name>Lei Xie</name>
    </author>
    <author>
      <name>Pengcheng Zhu</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09050v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09050v1" />
    <published>2026-06-08T05:39:23Z</published>
    <updated>2026-06-08T05:39:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Guobin Ma et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] BareWave: Waveform-Native Flow-Matching Text-to-Speech</title>
    <author>
      <name>Wei Fan</name>
    </author>
    <author>
      <name>Chao-Hong Tan</name>
    </author>
    <author>
      <name>Qian Chen</name>
    </author>
    <author>
      <name>Wen Wang</name>
    </author>
    <author>
      <name>Xiangang Li</name>
    </author>
    <author>
      <name>Kejiang Chen</name>
    </author>
    <author>
      <name>Weiming Zhang</name>
    </author>
    <author>
      <name>Nenghai Yu</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09048v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09048v1" />
    <published>2026-06-08T05:36:42Z</published>
    <updated>2026-06-08T05:36:42Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wei Fan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Under Review&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data</title>
    <author>
      <name>Moshe Mandel</name>
    </author>
    <author>
      <name>Shlomo E. Chazan</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08843v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08843v1" />
    <published>2026-06-07T21:25:14Z</published>
    <updated>2026-06-07T21:25:14Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Moshe Mandel, Shlomo E. Chazan&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present a voice conversion (VC) framework that utilizes K-Nearest Neighbors (KNN) retrieval over WavLM representations to align non-parallel source and target speech, constructing synthetic training pairs for supervised learning. The retrieved segments serve as synthetic inputs, while real target audio provides ground-truth outputs, forming a synthetic-to-real training paradigm that naturally supports multilingual data without requiring parallel corpora or explicit alignment. To ensure consistent target-speaker identity, we incorporate a speaker loss derived from a pretrained speaker verification model. Experiments across multiple languages demonstrate that the proposed approach achieves high naturalness and strong speaker similarity, outperforming competitive VC baselines, despite being trained exclusively on English data. Samples can be accessed at: https://palindromic-vc.github.io.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding</title>
    <author>
      <name>Matteo Spanio</name>
    </author>
    <author>
      <name>Mohammad Torabi</name>
    </author>
    <author>
      <name>Andrea Poltronieri</name>
    </author>
    <author>
      <name>Antonio Rodà</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08722v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08722v1" />
    <published>2026-06-07T16:32:59Z</published>
    <updated>2026-06-07T16:32:59Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Matteo Spanio et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fréchet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at Ital-IA 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Probing Token Spaces under Generator Shift in AI-Generated Music Detection</title>
    <author>
      <name>Joonyong Park</name>
    </author>
    <author>
      <name>Jungwoo Kim</name>
    </author>
    <author>
      <name>Junyoung Koh</name>
    </author>
    <author>
      <name>Yuki Saito</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08663v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08663v1" />
    <published>2026-06-07T15:08:19Z</published>
    <updated>2026-06-07T15:08:19Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Joonyong Park et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;AI-generated music detectors can appear robust on standard benchmark splits, yet their deployments require transfer to generator sources absent during training. We study this problem with source-restricted evaluation on \textsc{MoM-open}, an open reconstruction of MoM-CLAM that replaces the non-redistributable real corpus with FMA and MTG-Jamendo while preserving the fake-generator protocol. To isolate the role of representation, we introduce \textsc{CoMoE}, a compact fixed classifier for comparing heterogeneous audio token spaces while keeping the downstream architecture and training recipe unchanged. Experiments show that standard and real-source-restricted splits are nearly saturated, whereas fake-source restriction exposes large differences between token spaces: X-Codec tokens are strongest when training on Udio alone, while MERT-derived tokens are stronger when training on Suno-v3.5 alone. These results suggest that codec-style discrete token spaces should be treated as a primary experimental axis under generator shift in AI-generated music detection. Our code and data are available at https://github.com/MAAP-LAB/CoMoE.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ICML 2026 ML4Audio workshop&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] AudioProcessBench: Benchmark for Identifying Process Errors in Audio-Grounded Reasoning</title>
    <author>
      <name>Xiangyu Zhao</name>
    </author>
    <author>
      <name>Junyu Yan</name>
    </author>
    <author>
      <name>Yaling Shen</name>
    </author>
    <author>
      <name>Zimu Wang</name>
    </author>
    <author>
      <name>Yiwen Jiang</name>
    </author>
    <author>
      <name>Stephanie Fong</name>
    </author>
    <author>
      <name>Qingyang Xu</name>
    </author>
    <author>
      <name>Jiahe Liu</name>
    </author>
    <author>
      <name>Dominic Dwyer</name>
    </author>
    <author>
      <name>Zongyuan Ge</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.09925v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.09925v1" />
    <published>2026-06-07T12:24:18Z</published>
    <updated>2026-06-07T12:24:18Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xiangyu Zhao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large audio-language models (LALMs) increasingly use explicit reasoning traces for complex audio understanding, yet the evaluation of reasoning quality remains underexplored. Although process-level benchmarks for process reward models (PRMs) have advanced reasoning evaluation in text and multi-modal domains, comparable evaluation for audio reasoning remains limited. In this paper, we present AudioProcessBench, a comprehensive benchmark for step-level process error identification in audio reasoning. AudioProcessBench contains diverse reasoning traces generated by 6 audio and omni language models. Each trace is segmented into discrete reasoning steps and annotated with binary step correctness and fine-grained error types. Our benchmark evaluates models under three complementary paradigms: (1) step correctness identification, (2) error-type-conditioned detection for diagnosing audio-specific verifier capacities, and (3) chain-level aggregation, where verifiers select or aggregate among multiple reasoning traces for the same question. This design enables a systematic analysis of whether current models can detect process errors, whether their weaknesses differ across audio-specific error types, and whether process verification translates into improved answer selection. AudioProcessBench provides a testbed for future research on audio reasoning verifiers, process reward models, and reliable omni-modal reasoning.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] G-MaP-SE: Guided Speech Enhancement via GMM-Based Prior Matching</title>
    <author>
      <name>Yike Zhu</name>
    </author>
    <author>
      <name>Ziqian Wang</name>
    </author>
    <author>
      <name>Zikai Liu</name>
    </author>
    <author>
      <name>Xingchen Li</name>
    </author>
    <author>
      <name>Zhuangqi Chen</name>
    </author>
    <author>
      <name>Xianjun Xia</name>
    </author>
    <author>
      <name>Chuanzeng Huang</name>
    </author>
    <author>
      <name>Lei Xie</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08580v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08580v1" />
    <published>2026-06-07T11:28:32Z</published>
    <updated>2026-06-07T11:28:32Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yike Zhu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Using speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framework that builds a clean-speech embedding prior with a Gaussian Mixture Model (GMM) and refines a noisy conditioning embedding by matching it to this prior. The matched prior embedding is then injected into a time-frequency enhancement backbone via a lightweight gated fusion module. Experiments on VoiceBank+DEMAND and DNS Challenge 2020 datasets show that the proposed prior matching consistently outperforms noisy conditioning and substantially narrows the gap to an oracle clean-conditioning upper bound, while requiring no enrollment audio at inference time. The code, audio samples, and checkpoint are available.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints</title>
    <author>
      <name>Vinh-Thuan Ly</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08425v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08425v1" />
    <published>2026-06-07T02:50:24Z</published>
    <updated>2026-06-07T02:50:24Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Vinh-Thuan Ly&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent. On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments. These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026. Project page: https://interspeech-tinygiant-alm.vercel.app&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] FXplorer: A Map-Based Interface for Exploratory Audio Effect Design</title>
    <author>
      <name>Annie Chu</name>
    </author>
    <author>
      <name>Jason Brent Smith</name>
    </author>
    <author>
      <name>Bryan Pardo</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08286v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08286v1" />
    <published>2026-06-06T18:14:41Z</published>
    <updated>2026-06-06T18:14:41Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Annie Chu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Audio effects (FX) shape sound in contemporary music practice. However, most interfaces present them as discrete modules and parameters that favor targeted adjustment over exploratory listening. This separation can make it difficult to build intuition about the broader space of possible transformations or to move fluidly between searching and refinement. We present FXplorer, an interface that organizes audio effects within a perceptually informed 2D space, allowing sound transformations to be browsed as a continuous landscape rather than as isolated presets. By combining established spatial interaction approaches and interpretable DAW-style controls with recent embedding-based machine learning methods for similarity and semantic search, the system brings exploration and parameter refinement into a single workspace. FXplorer supports composition, production, or performance by allowing users to edit and interpolate between effect presets interactively.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to NIME 2026. Project page: https://anniejchu.github.io/fxplorer/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion</title>
    <author>
      <name>Rashini Liyanarachchi</name>
    </author>
    <author>
      <name>Rachael Mackay</name>
    </author>
    <author>
      <name>Alison Short</name>
    </author>
    <author>
      <name>Aditya Joshi</name>
    </author>
    <author>
      <name>Erik Meijering</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08210v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08210v1" />
    <published>2026-06-06T14:54:44Z</published>
    <updated>2026-06-06T14:54:44Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Rashini Liyanarachchi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Automated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability in developing voices and the subtle distinction between pathological stuttering and typical developmental disfluencies. We introduce Paediatric-HGNN, a framework using a Context-aware Part-whole Interaction Network (CaPIN) tailored for paediatric data. Instead of conventional 1D signal modelling, our approach builds a heterogeneous graph capturing hierarchical relationships between lexical units (word nodes) and fine-grained acoustic segments (frame nodes). Trained on curated paediatric corpora (UCLASS and FluencyBank), Paediatric-HGNN achieves 82.4% weighted accuracy and a Typical Disfluency F1-score of 0.386. Modelling hierarchical lexical-acoustic interactions captures developmental &amp;quot;searching&amp;quot; behaviour, offering a more robust and interpretable tool for early clinical intervention.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at INTERSPEECH 2026 (Main)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Assessing the Energy and Carbon Emissions of Neural Speaker Verification Model in Training and Inference</title>
    <author>
      <name>Hugo Leguillier</name>
    </author>
    <author>
      <name>Driss Matrouf</name>
    </author>
    <author>
      <name>Guillaume Lechien</name>
    </author>
    <author>
      <name>Mickael Rouvier</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08087v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08087v1" />
    <published>2026-06-06T10:23:18Z</published>
    <updated>2026-06-06T10:23:18Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Hugo Leguillier et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Deep-learning speaker verification (SV) increasingly relies on deep neural network backbones, whose environmental impact remains largely undocumented. In this paper, we conduct an evaluation of ResNet architectures trained on VoxCeleb2, varying depth, channel width, and stage distribution, and measure energy consumption and carbon footprint using node-level sensors. Results show a clear point of diminishing returns: deeper or wider models bring only marginal accuracy gains while energy consumption grows steeply. In contrast, mid-sized networks such as ResNet-50 and stage-concentrated variants achieve favorable trade-offs between performance and environmental impact. These findings provide actionable guidelines for designing energy-efficient SV systems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Speaker Odyssey 2026 Lisbon&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] On Low-Bit Quantization Errors in Speaker Verification: Diagnostic and Mitigation</title>
    <author>
      <name>Hugo Leguillier</name>
    </author>
    <author>
      <name>Driss Matrouf</name>
    </author>
    <author>
      <name>Guillaume Lechien</name>
    </author>
    <author>
      <name>Mickael Rouvier</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08078v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08078v1" />
    <published>2026-06-06T09:55:37Z</published>
    <updated>2026-06-06T09:55:37Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Hugo Leguillier et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Although low-bit quantization provides practical means to deploy speaker verification on resource-constrained devices, its effects on speaker verification performance remain poorly understood. In this paper, we study uniform K-means quantization-aware training of ResNet-36 and ResNet-200 through joint layer-wise and score-level analyses. Our layer-wise analysis highlights fragile components and shows that score degradation is not fully explained by weight distortion alone. We identify a clear knee point at 2 bits, with larger score drift and harmful decision flips concentrated near the FP32 threshold. Our score-level analysis reveals where and how score errors emerge under extreme quantization. Building on these findings, we propose a calibrated multi-precision cascade that resolves most trials at 2 bits and escalates only ambiguous cases, achieving performance close to FP32 while preserving the efficiency benefits of low-bit inference with substantially lower compute and memory costs.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at Speaker Odyssey 2026 Lisbon&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Exploring the Scale and Diversity of Speech Anti-spoofing Datasets: Experiments and Analysis</title>
    <author>
      <name>Zhuolin Yi</name>
    </author>
    <author>
      <name>Jun Xue</name>
    </author>
    <author>
      <name>Yanzhen Ren</name>
    </author>
    <author>
      <name>Yihuan Huang</name>
    </author>
    <author>
      <name>Yi Chai</name>
    </author>
    <author>
      <name>Daixian Li</name>
    </author>
    <author>
      <name>Guanxiang Feng</name>
    </author>
    <author>
      <name>Jiajun Liu</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.08038v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.08038v1" />
    <published>2026-06-06T07:58:02Z</published>
    <updated>2026-06-06T07:58:02Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhuolin Yi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The scale of speech anti-spoofing datasets has grown exponentially over the past decade, driven by the assumption that larger data leads to better performance. However, it remains unclear whether indiscriminate scaling commensurately improves model generalization. This study challenges the &amp;quot;scale-first&amp;quot; paradigm by decoupling the impacts of training data scale versus diversity. Through experiments on representative datasets, we report two key findings: (1) Larger is not always better. Expanding data scale excessively under fixed generation methods yields negligible returns and may even degrade cross-domain generalization due to overfitting.(2) Diversity outweighs scale. A smaller composite training set featuring diverse attacks significantly outperforms larger-scale datasets with limited diversity in cross-dataset evaluations. We conclude that future dataset construction should prioritize the diversity of generation methods over scale to effectively enhance model generalization.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech</title>
    <author>
      <name>Xuanjun Chen</name>
    </author>
    <author>
      <name>Yun-Shing Wu</name>
    </author>
    <author>
      <name>Wei-Chung Lu</name>
    </author>
    <author>
      <name>Claire Lin</name>
    </author>
    <author>
      <name>Haibin Wu</name>
    </author>
    <author>
      <name>Hung-yi Lee</name>
    </author>
    <author>
      <name>Jyh-Shing Roger Jang</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07494v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07494v1" />
    <published>2026-06-05T17:48:46Z</published>
    <updated>2026-06-05T17:48:46Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Xuanjun Chen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates &amp;quot;in-the-wild&amp;quot; variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Work in progress&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders</title>
    <author>
      <name>Georgii Aparin</name>
    </author>
    <author>
      <name>Vadim Popov</name>
    </author>
    <author>
      <name>Tasnima Sadekova</name>
    </author>
    <author>
      <name>Assel Yermekova</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07473v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07473v1" />
    <published>2026-06-05T17:26:23Z</published>
    <updated>2026-06-05T17:26:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Georgii Aparin et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper&amp;#x27;s internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement</title>
    <author>
      <name>Yifan Duan</name>
    </author>
    <author>
      <name>Qixiang Xu</name>
    </author>
    <author>
      <name>Hengtao Wu</name>
    </author>
    <author>
      <name>Zhanxun Liu</name>
    </author>
    <author>
      <name>Wenhao Guan</name>
    </author>
    <author>
      <name>Junxi Liu</name>
    </author>
    <author>
      <name>Ziyang Ma</name>
    </author>
    <author>
      <name>Kelu Xu</name>
    </author>
    <author>
      <name>Xie Chen</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07397v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07397v1" />
    <published>2026-06-05T15:38:08Z</published>
    <updated>2026-06-05T15:38:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yifan Duan et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;In recent years, audio generation has made significant progress in tasks such as text-to-speech (TTS), text-to-audio (TTA) and text-to-music (TTM). However, generating long-form and controllable audio from complex audio scene descriptions remains a significant challenge, as such scenes often require coordinated speech, sound effects, music, songs, temporal structure, and post-production. In this work, we introduce \textbf{Audio-Oscar}, a multi-agent framework for generating audio from complex descriptions. Audio-Oscar coordinates a set of specialist agents, each responsible for a different aspect of the audio scene, including character modeling and voice design, speech generation, fine-grained timeline planning, model selection, non-speech generation, and audio post-production. Audio-Oscar further incorporates feedback-driven refinement. In addition, to address the lack of suitable benchmarks for evaluating audio generation from complex audio scene descriptions, we construct \textbf{ASG-Bench}, an Audio Scene Generation Benchmark containing both scene descriptions paired with reference audio and text-only scene descriptions. Each scene is annotated with target audio events and temporal statements to evaluate whether the generated audio faithfully realizes the required scene content and temporal structure. Experimental results show that Audio-Oscar can effectively generate audio that matches complex scene descriptions. Project samples are available at https://audiooscar.github.io/. Our code is available at https://github.com/ziye26/Audio-Oscar.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast</title>
    <author>
      <name>Zhengkun Ge</name>
    </author>
    <author>
      <name>Xiaoqian Liu</name>
    </author>
    <author>
      <name>Haoran Zhang</name>
    </author>
    <author>
      <name>Yuan Ge</name>
    </author>
    <author>
      <name>Junxiang Zhang</name>
    </author>
    <author>
      <name>Zhengtao Yu</name>
    </author>
    <author>
      <name>Jingbo Zhu</name>
    </author>
    <author>
      <name>Tong Xiao</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07356v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07356v1" />
    <published>2026-06-05T15:04:22Z</published>
    <updated>2026-06-05T15:04:22Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhengkun Ge et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free editing is appealing as it decreases computational overhead and reconstruction errors, it remains largely unexplored for audio editing. The key challenge is to construct a source-to-target editing path through diffusion denoising dynamics. In this paper, we introduce DirectAudioEdit, the first attempt to develop a training-free and inversion-free method for audio editing. Experiments on music and event-level benchmarks across two backbones show that DirectAudioEdit reduces macro-averaged FAD and KL by 15.9% and 15.8% compared with DDPM inversion, while achieving up to 64.5% editing speedup.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling</title>
    <author>
      <name>Jinju Lee</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07334v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07334v1" />
    <published>2026-06-05T14:49:24Z</published>
    <updated>2026-06-05T14:49:24Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jinju Lee&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Harmony is a compact symbolic layer where mathematical pitch relations, acoustic consonance, and musical convention meet. This report treats chord-symbol sequences not as a complete representation of music, but as an interpretable, controllable time series for genre-local harmonic modeling. Starting from a frozen pop-jazz Music Transformer checkpoint, I evaluate how far small adaptation interfaces can extend the model to eleven target genres: blues, bossa nova, Bach chorales, country, electronic, folk, funk, gospel, hip-hop, R&amp;amp;B/soul, and rock. The main evaluation compares LoRA, IA3, BitFit, prefix tuning, and full fine-tuning over 11 genres and 3 seeds, a complete 165-cell grid. All five methods improve over the frozen base on held-out chord prediction, with macro gains from +2.89 to +3.61 points; LoRA and IA3 score highest, but Wilcoxon tests with Holm and Benjamini-Hochberg correction do not support a decisive winner. A matched-data-size control sharpens this: when genres are sub-sampled to a common corpus size, IA3 stays on top but LoRA&amp;#x27;s full-data edge disappears and it falls to last, indicating the small gaps are partly data-driven. A control-token baseline is also strong, and wrong-genre adapters often beat the frozen base, suggesting much of the effect comes from lightweight conditioning over a reusable harmonic base rather than one particular adapter family. Additional diagnostics (rank sweeps, wrong-genre rotation, a base-checkpoint ablation, chord-only genre classification, generated-output statistics, real-song evaluation, and duplicate analysis) support a bounded conclusion: chord-symbol adaptation reliably improves genre-local harmonic prediction, but chord symbols alone do not carry complete genre identity. The report therefore avoids claims about perceived genre authenticity or full musical quality, which require controlled listener or musician evaluation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;16 pages, 4 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition</title>
    <author>
      <name>Iosif Tsangko</name>
    </author>
    <author>
      <name>Andreas Triantafyllopoulos</name>
    </author>
    <author>
      <name>Björn W. Schuller</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07309v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07309v1" />
    <published>2026-06-05T14:26:06Z</published>
    <updated>2026-06-05T14:26:06Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Iosif Tsangko et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question in speech emotion recognition (SER) by deriving six interpretable acoustic concept tokens from the standardised eGeMAPS paralinguistic feature set. These tokens summarise energy, pitch, dynamics, brightness, formants, and voice quality, and are appended to the textual prompt while the audio input is kept unchanged. Across the widely used FAU-Aibo and IEMOCAP benchmarks, aligned tokens improve unweighted average recall (UAR), whereas shuffled, conflicting, or corrupted tokens reduce performance relative to aligned tokens and shift confusions toward neutral. Importantly, predictions do not collapse under strong token perturbations, suggesting that the models are sensitive to the symbolic cue channel but remain partly anchored to the audio signal. We argue that token-only interventions provide a practical way to probe audio-grounded cue use, robustness, and interpretability in ALM-based affective computing.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;6 pages, 3 figures, 3 tables&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.LG] Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path</title>
    <author>
      <name>Thomas Sesmat</name>
    </author>
    <author>
      <name>Gabriel Meseguer-Brocal</name>
    </author>
    <author>
      <name>Geoffroy Peeters</name>
    </author>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07271v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07271v1" />
    <published>2026-06-05T13:46:37Z</published>
    <updated>2026-06-05T13:46:37Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Thomas Sesmat et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Understanding what generative models retain from training data remains challenging, with implications for copyright and privacy. Beyond verbatim reproduction, models can encode subtler traces of their training data that never surface in their outputs yet remain exploitable. We study this regime for Rectified Flows, which are increasingly used in deployed generative systems. We analyse the interpolation path $X_λ= (1-λ)X_0 + λX_1$ that defines the Rectified Flow training. We show that a gap exists between the reconstruction of train and test data that follows a bell-shaped curve over $λ$, wich accumulates during training, while the validation metrics remain stable. The signal has a maximum whose location we derive in closed form under Gaussian assumptions. We validate these predictions on both audio and images and show that the bell-shaped structure is universal, while the peak prediction holds when our assumptions are satisfied. As a proof of concept, we exploit this specific $λ$-resolved structure to perform a Membership Inference Attack, distinguishing members of the training set from non-members.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;ICML 2026 article, 9 main pages and 25 with annexes, 11 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] Assessing True Generalisability of Audio-Visual Speech Recognisers</title>
    <author>
      <name>Zhaofeng Lin</name>
    </author>
    <author>
      <name>Stavros Petridis</name>
    </author>
    <author>
      <name>Maja Pantic</name>
    </author>
    <author>
      <name>Naomi Harte</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07259v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07259v1" />
    <published>2026-06-05T13:35:10Z</published>
    <updated>2026-06-05T13:35:10Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhaofeng Lin et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Current Audio-Visual Speech Recognition (AVSR) models achieve near-perfect performance on the standard LRS3 benchmark, raising concerns of adaptive overfitting. To systematically assess true generalisability, we construct a highly controlled, unseen evaluation set subsampled from the massive MultiVSR dataset. Unlike standard out-of-distribution benchmarks, our subset strictly matches the acoustic, visual, and demographic distributions of the LRS3 test set. Evaluating five state-of-the-art architectures reveals a universal performance collapse, proving that current systems fail to generalise even under strictly aligned conditions. Through a fine-grained attribute analysis across seven factors, we isolate the specific drivers of this degradation. Furthermore, we uncover a profound lexical bias, expose distinct error patterns, and surprisingly reveal that audio-visual performance even lags behind audio-only settings. We release our matched test set for future benchmarking.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026 Long paper track. 9 pages, 4 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026</title>
    <author>
      <name>Seymanur Akti</name>
    </author>
    <author>
      <name>Alexander Waibel</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07240v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07240v1" />
    <published>2026-06-05T13:09:21Z</published>
    <updated>2026-06-05T13:09:21Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Seymanur Akti, Alexander Waibel&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] MMAE: A Massive Multitask Audio Editing Benchmark</title>
    <author>
      <name>Ziyang Ma</name>
    </author>
    <author>
      <name>Ruiqi Yan</name>
    </author>
    <author>
      <name>Ruiyang Xu</name>
    </author>
    <author>
      <name>Jie Fang</name>
    </author>
    <author>
      <name>Zhikang Niu</name>
    </author>
    <author>
      <name>Yi-Wen Chao</name>
    </author>
    <author>
      <name>Wenming Tu</name>
    </author>
    <author>
      <name>Tianrui Wang</name>
    </author>
    <author>
      <name> Auden</name>
    </author>
    <author>
      <name>Qi Chen</name>
    </author>
    <author>
      <name>Wenxi Chen</name>
    </author>
    <author>
      <name>Jiaying Chi</name>
    </author>
    <author>
      <name>Yanru Huo</name>
    </author>
    <author>
      <name>Zixuan Jiang</name>
    </author>
    <author>
      <name>Xiquan Li</name>
    </author>
    <author>
      <name>Yalin Li</name>
    </author>
    <author>
      <name>Junxi Liu</name>
    </author>
    <author>
      <name>Minghao Liu</name>
    </author>
    <author>
      <name>Binghao Qiang</name>
    </author>
    <author>
      <name>Yijia Shan</name>
    </author>
    <author>
      <name>Zheshu Song</name>
    </author>
    <author>
      <name>Tian Tan</name>
    </author>
    <author>
      <name>Zixiang Wang</name>
    </author>
    <author>
      <name>Zeyu Xie</name>
    </author>
    <author>
      <name>Zhifei Xie</name>
    </author>
    <author>
      <name>Xiaoyu Xing</name>
    </author>
    <author>
      <name>Qixiang Xu</name>
    </author>
    <author>
      <name>Chen Yang</name>
    </author>
    <author>
      <name>Guanrou Yang</name>
    </author>
    <author>
      <name>Shan Yang</name>
    </author>
    <author>
      <name>Yifan Yang</name>
    </author>
    <author>
      <name>Steve Yves</name>
    </author>
    <author>
      <name>Haotian Zhang</name>
    </author>
    <author>
      <name>Haina Zhu</name>
    </author>
    <author>
      <name>Kai Yu</name>
    </author>
    <author>
      <name>Liefeng Bo</name>
    </author>
    <author>
      <name>Eng-Siong Chng</name>
    </author>
    <author>
      <name>Xie Chen</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07229v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07229v1" />
    <published>2026-06-05T12:52:41Z</published>
    <updated>2026-06-05T12:52:41Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ziyang Ma et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Open-Source at https://github.com/ddlBoJack/MMAE&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] A Large-Scale Per-Speaker Analysis of Re-identification Risk in Speech Anonymization</title>
    <author>
      <name>Orane Dufour</name>
    </author>
    <author>
      <name>Paul Magron</name>
    </author>
    <author>
      <name>Mickael Rouvier</name>
    </author>
    <author>
      <name>Emmanuel Vincent</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07210v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07210v1" />
    <published>2026-06-05T12:21:25Z</published>
    <updated>2026-06-05T12:21:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Orane Dufour et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Speech anonymization is commonly evaluated using averagecase metrics such as the equal error rate, which can hide large disparities in re-identification risks across individuals. In this paper, we conduct a large-scale per-speaker privacy analysis using a linkability-based metric under a worst-case scenario. Nearly 5,000 speakers are evaluated across multiple anonymization systems, attacker architectures, and conversation lengths. While linkability scores are highly polarized at the speaker level, the sets of easy to re-identify and hard to re-identify speakers vary substantially across configurations. We show that no single factor explains speaker vulnerability. Instead, the re-identification risk emerges from the interaction between the attacker, the anonymizer, and the amount of available speech. These results challenge the notion of intrinsic speaker-level privacy risks and emphasize the need for evaluation protocols that are explicitly conditioned on the attacker and anonymizer.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] dots.tts Technical Report</title>
    <author>
      <name>Shi Lian</name>
    </author>
    <author>
      <name>Changtao Li</name>
    </author>
    <author>
      <name>Bohan Li</name>
    </author>
    <author>
      <name>Hankun Wang</name>
    </author>
    <author>
      <name>Da Zheng</name>
    </author>
    <author>
      <name>Junfeng Tian</name>
    </author>
    <author>
      <name>Yufeng Ma</name>
    </author>
    <author>
      <name>Colin Zhang</name>
    </author>
    <author>
      <name>Kai Yu</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07080v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07080v1" />
    <published>2026-06-05T09:19:24Z</published>
    <updated>2026-06-05T09:19:24Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shi Lian et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation</title>
    <author>
      <name>Ziyu Zhang</name>
    </author>
    <author>
      <name>Chunyu Qiang</name>
    </author>
    <author>
      <name>Xiaopeng Wang</name>
    </author>
    <author>
      <name>Yuxin Guo</name>
    </author>
    <author>
      <name>Kang Yin</name>
    </author>
    <author>
      <name>Wenjie Tian</name>
    </author>
    <author>
      <name>Jingbin Hu</name>
    </author>
    <author>
      <name>Tianlun Zuo</name>
    </author>
    <author>
      <name>Zhao Guo</name>
    </author>
    <author>
      <name>Teng Ma</name>
    </author>
    <author>
      <name>Yuzhe Liang</name>
    </author>
    <author>
      <name>Chen Zhang</name>
    </author>
    <author>
      <name>Lei Xie</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.07015v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.07015v1" />
    <published>2026-06-05T07:59:17Z</published>
    <updated>2026-06-05T07:59:17Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ziyu Zhang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] MyGardenBird: A Machine-Learning-Ready Bird Sound Dataset for Twelve Common Malaysian Birds</title>
    <author>
      <name>Muhammad Mun'im Ahmad Zabidi</name>
    </author>
    <author>
      <name>Mohd Yamani Idna Idris</name>
    </author>
    <author>
      <name>Norisma Idris</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06975v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06975v1" />
    <published>2026-06-05T07:07:22Z</published>
    <updated>2026-06-05T07:07:22Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Muhammad Mun&amp;#x27;im Ahmad Zabidi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Bioacoustic datasets from tropical regions remain limited, in part due to the absence of reproducible workflows for aggregating recordings from public archives. We present \textbf{MyGardenBird}, a curated dataset of bird vocalisations representing twelve common species across Peninsular Malaysia and the Indo-Malayan region. Recordings were sourced from Xeno-canto and processed through species-level filtering, manual spectrogram segmentation, and quality control checks. The primary release comprises 7,200 manually validated audio clips (16 kHz, 16-bit PCM mono WAV), balanced at 600 three-second clips per species (6.0 hours total) derived from 1,381 distinct recordings. Metadata includes geospatial coordinates, vocalisation categories, and signal-to-noise ratio (SNR) values (range: 0.83--59.18 dB; mean: 15.80 dB). A supplementary 44.1 kHz version is also provided. To mitigate data leakage, dataset partitions are defined at the source-recording level. Baseline classification experiments using convolutional neural networks on Mel-spectrograms achieved test accuracies of 92--96\%, indicating strong interspecies separability. Limitations include reliance on single-annotator curation; however, validation with BirdNET confirmed label consistency. MyGardenBird is openly available at https://doi.org/10.5281/zenodo.20306877 under a CC BY-NC-SA 4.0 licence. Complete preprocessing code accompanies the release to support reproducibility and future expansion.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;17 pages, 9 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models</title>
    <author>
      <name>Zhixian Zhao</name>
    </author>
    <author>
      <name>Shuiyuan Wang</name>
    </author>
    <author>
      <name>Wenjie Tian</name>
    </author>
    <author>
      <name>Jingbin Hu</name>
    </author>
    <author>
      <name>Ziyu Zhang</name>
    </author>
    <author>
      <name>Lei Xie</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06940v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06940v1" />
    <published>2026-06-05T06:11:38Z</published>
    <updated>2026-06-05T06:11:38Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhixian Zhao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;While Audio Language Models (ALMs) demonstrate strong semantic understanding, they struggle with complex affective interactions. Specifically, textual semantic dominance often overshadows acoustic nuances, and a lack of cognitive depth leads to generic, emotion-agnostic responses. We propose CogAudio-LLM\footnote{ \urlstyle{same} https://github.com/zxzhao0/CogAudio-LLM, a novel cognitive affective reasoning framework. To mitigate semantic dominance, we build LIME-440K, a ``lexically-identical, multi-emotion&amp;#x27;&amp;#x27; dataset designed to facilitate acoustic-semantic decoupling. We introduce EIPS, a 4-step Chain-of-Thought (CoT) mechanism incorporating psychological reasoning. For inference efficiency, multi-stage training explicitly establishes EIPS via supervised fine-tuning, then distills this logic into an implicit generation process. Finally, we design DR-SAPO (Dual-Route Soft Adaptive Policy Optimization) to dynamically balance the logical rigor of the CoT with the empathetic quality of the direct response.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by Interspeech2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] VoxCPM2 Technical Report</title>
    <author>
      <name>Yixuan Zhou</name>
    </author>
    <author>
      <name>Guoyang Zeng</name>
    </author>
    <author>
      <name>Xin Liu</name>
    </author>
    <author>
      <name>Xiang Li</name>
    </author>
    <author>
      <name>Renjie Yu</name>
    </author>
    <author>
      <name>Jiancheng Gui</name>
    </author>
    <author>
      <name>Jiaheng Wu</name>
    </author>
    <author>
      <name>Ziyang Wang</name>
    </author>
    <author>
      <name>Xudong Shen</name>
    </author>
    <author>
      <name>Runchuan Ye</name>
    </author>
    <author>
      <name>Zhisheng Zhang</name>
    </author>
    <author>
      <name>Jiuyang Zhou</name>
    </author>
    <author>
      <name>Bingsong Bai</name>
    </author>
    <author>
      <name>Weiyue Sun</name>
    </author>
    <author>
      <name>Mengyuan Deng</name>
    </author>
    <author>
      <name>Qundong Shi</name>
    </author>
    <author>
      <name>Zhiyong Wu</name>
    </author>
    <author>
      <name>Zhiyuan Liu</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06928v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06928v1" />
    <published>2026-06-05T05:43:15Z</published>
    <updated>2026-06-05T05:43:15Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yixuan Zhou et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present VoxCPM2, a https://info.arxiv.org/help/prep#abstractsfully open-source multilingual and controllable speech generation foundation model that extends the hierarchical diffusion-autoregressive modeling paradigm of VoxCPM. VoxCPM2 advances the framework in three key dimensions: (i) capability, by unifying 30 languages, 9 Chinese dialects, natural-language voice design, style-controllable voice cloning, and high-fidelity continuation cloning within a single backbone; (ii) quality, through an asymmetric AudioVAE that encodes at 16 kHz and reconstructs at 48 kHz, enabling implicit super-resolution with high encoding efficiency; and (iii) scale, by jointly scaling the model to 2B parameters and the training data to over 2 million hours of multilingual speech. To support these diverse capabilities within one model, we introduce a unified sequence organization that expresses all generation modes through different arrangements of the same input building blocks, allowing joint training under a single set of parameters and objective. VoxCPM2 achieves state-of-the-art or competitive performance on public zero-shot and instruction-following TTS benchmarks. On our internal 30-language evaluation set, it attains an average WER of 1.68%. These results demonstrate that hierarchical continuous-latent modeling, without relying on any external discrete speech tokenizer, offers a viable and powerful foundation for large-scale multilingual and controllable speech generation. The model weights, fine-tuning code, and inference tools are publicly released under the Apache 2.0 license to foster community research and development.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;The technical report of VoxCPM2, a TTS foundation model (GitHub: https://github.com/OpenBMB/VoxCPM)&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Towards Event-Robust Acoustic Scene Classification</title>
    <author>
      <name>Yiqiang Cai</name>
    </author>
    <author>
      <name>Bohan Hu</name>
    </author>
    <author>
      <name>Yu Yang</name>
    </author>
    <author>
      <name>Pengwei Lu</name>
    </author>
    <author>
      <name>Shengchen Li</name>
    </author>
    <author>
      <name>Xi Shao</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06921v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06921v1" />
    <published>2026-06-05T05:35:03Z</published>
    <updated>2026-06-05T05:35:03Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yiqiang Cai et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;This paper introduces the Event-Shifted Acoustic Scene (ESAS) dataset, a novel benchmark for evaluating the robustness of Acoustic Scene Classification (ASC) systems against unknown sound events. Existing ASC datasets typically contain recordings of clean and consistent audio, while real-world environments often include diverse and unexpected sound events. To bridge this gap, ESAS simulates real-world acoustic variability by injecting foreground sound events into background scenes with the assistance of large language models. In this work, we present the construction methodology, dataset statistics, and evaluation protocols. Furthermore, a comprehensive evaluation of state-of-the-art ASC systems is conducted using the ESAS benchmark. Experimental results reveal that existing ASC models suffer significant performance degradation when facing the event-shift challenge. The introduction of the ESAS dataset aims to drive future research toward event-robust ASC.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026.The ESAS dataset and source code are available at: https://github.com/bohanhu118/Interspeech2026_ESAS&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models</title>
    <author>
      <name>Seonuk Kim</name>
    </author>
    <author>
      <name>Yonghyeon Jun</name>
    </author>
    <author>
      <name>Ju Yeon Kang</name>
    </author>
    <author>
      <name>Jimin Hong</name>
    </author>
    <author>
      <name>Yoonhyeong Lee</name>
    </author>
    <author>
      <name>Nam Soo Kim</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06907v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06907v1" />
    <published>2026-06-05T04:50:34Z</published>
    <updated>2026-06-05T04:50:34Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Seonuk Kim et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Large audio language models (LALMs) extend large language models with an audio encoder and large-scale audio data. However, the scarcity of high-quality annotated audio data remains a fundamental bottleneck for scaling. Through probing signal detectability analysis, we identify fine-grained spectrotemporal perceptual weaknesses in a foundation LALM. To address these challenges, we propose Spectrotemporal Counting (SpectCount), a data-efficient fine-tuning approach based on fully synthetic audio signals generated on-the-fly, without relying on real-world audio, annotations, or pretrained generative models. SpectCount not only resolves the observed weaknesses but also improves performance on diverse auditory benchmarks spanning sound, music, and speech, unseen during fine-tuning. These results suggest that weakness-targeted synthetic signals provide a data-efficient path toward enhanced auditory understanding capabilities in LALMs.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;5 pages, 5 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Leveraging Soft Distributions of SSL-Derived Discrete Speech Tokens for Downstream Inference</title>
    <author>
      <name>Kentaro Onda</name>
    </author>
    <author>
      <name>Satoru Fukayama</name>
    </author>
    <author>
      <name>Daisuke Saito</name>
    </author>
    <author>
      <name>Nobuaki Minematsu</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06806v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06806v1" />
    <published>2026-06-05T01:14:12Z</published>
    <updated>2026-06-05T01:14:12Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Kentaro Onda et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Discrete speech tokens obtained from self-supervised learning (SSL) models provide efficient data compression while maintaining strong performance, and have been widely used as intermediate representations in various tasks. However, discretization inevitably causes information loss, leading to degraded performance compared with continuous SSL features. In this work, we propose to apply soft token assignment only during downstream inference. This approach preserves the efficiency of hard discretization during training while enhancing the expressiveness of the tokens at inference. The proposed method outperforms conventional hard assignment on both ASR and speech synthesis tasks, and exhibits particularly strong generalizability to out-of-domain data. For ASR of non-native speech, it even surpasses models using continuous SSL features. Moreover, analysis of the resulting representations shows they align more accurately with phonemes compared with conventional hard assignment.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] BiEAR: A Human Auditory-Inspired Adaptive Binaural Front-end for Multi-Speaker Localisation and Distance Estimation</title>
    <author>
      <name>Hanyu Meng</name>
    </author>
    <author>
      <name>Eliathamby Ambikairajah</name>
    </author>
    <author>
      <name>Vidhyasaharan Sethu</name>
    </author>
    <author>
      <name>Qiquan Zhang</name>
    </author>
    <author>
      <name>Haizhou Li</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06795v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06795v1" />
    <published>2026-06-05T00:45:28Z</published>
    <updated>2026-06-05T00:45:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Hanyu Meng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present BiEAR, a human auditory-inspired adaptive binaural front-end for multi-speaker localisation and distance estimation. Inspired by medial olivocochlear (MOC) feedback in human hearing, BiEAR uses a neural controller to adaptively adjust the frequency selectivity of a binaural auditory filterbank during inference. This yields time-frequency adaptive representations for ears, enabling the model to respond to changing acoustic conditions. We evaluate BiEAR on multi-speaker localisation and distance estimation in anechoic and real-room environments. Results show that the adaptive front-end improves localisation accuracy and robustness to unseen speakers and rooms compared with commonly used fixed binaural front-ends. Visualisation and analysis of learned filter adaptations show that BiEAR emphasises informative frequency bands over time. These findings suggest that adaptive, biologically inspired binaural front-ends can improve machine hearing robustness in complex acoustic scenes.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to INTERSPEECH 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations</title>
    <author>
      <name>Naman Kothari</name>
    </author>
    <author>
      <name>Arjun Gangwar</name>
    </author>
    <author>
      <name>Adarsh Arigala</name>
    </author>
    <author>
      <name>S Umesh</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06740v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06740v1" />
    <published>2026-06-04T21:54:56Z</published>
    <updated>2026-06-04T21:54:56Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Naman Kothari et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech generation. Despite growing use in Audio LLMs and speech to speech systems, unit vocoders remain underexplored. We analyze a BigVGAN based unit vocoder, across four Indian languages. We study the interaction between cluster size and conditioning strategies using WER, speaker similarity, and unit level metrics. Results show that cluster size governs intelligibility by improving phonetic discriminability, while explicit speaker conditioning is indispensable for preventing identity collapse. Language supervision yields further gains mainly at lower cluster sizes where units remain ambiguous. Our analysis shows similar phonemes across languages collapse to the same cluster IDs at smaller inventories, with larger clusters progressively separating them.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] FIGMA: Towards FIne-Grained Music retrievAl</title>
    <author>
      <name>Nishit Anand</name>
    </author>
    <author>
      <name>Ashish Seth</name>
    </author>
    <author>
      <name>Sreyan Ghosh</name>
    </author>
    <author>
      <name>Dinesh Manocha</name>
    </author>
    <author>
      <name>Ramani Duraiswami</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06615v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06615v1" />
    <published>2026-06-04T18:05:39Z</published>
    <updated>2026-06-04T18:05:39Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Nishit Anand et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Retrieving music using natural language descriptions has improved with contrastive audio-text models such as CLAP, but current systems remain limited to coarse semantic queries. When descriptions specify fine-grained musical attributes such as tempo, key, chord progression, or rhythmic structure, existing models often fail to retrieve the correct audio. We show that this limitation stems from the contrastive learning objective itself: despite being trained on long captions, CLAP-based models effectively utilize only the first few tokens, discarding much of the information encoded in detailed prompts. Then, we propose FIGMA (FIne-Grained Music RetrievAl), a multi-view contrastive architecture that addresses this limitation by jointly optimizing global audio-text alignment and frame-level, token-wise alignment. This design enables FIGMA to capture both high-level semantic context and fine-grained musical attributes within a unified representation space. Moreover, we formalize the task of Fine-Grained Music Retrieval and construct Fine-Grained Music Caption dataset (FGMCaps), a large-scale dataset of 380K music-caption pairs for training along with a 10K test set, both annotated with tempo, key, chord progression, beat count, as well as genre and mood. Extensive experiments demonstrate that FIGMA consistently outperforms existing CLAP-based music retrieval models across multiple music retrieval benchmarks, including out-of-domain evaluations, with relative improvements of up to 73.3%.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to ACL 2026. Project Website: https://nishitanand.github.io/figma-website/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition</title>
    <author>
      <name>Fernando López</name>
    </author>
    <author>
      <name>Santosh Kesiraju</name>
    </author>
    <author>
      <name>Jordi Luque</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06211v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06211v1" />
    <published>2026-06-04T14:20:11Z</published>
    <updated>2026-06-04T14:20:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Fernando López et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted in Odyssey 2026: The Speaker and Language Recognition Workshop&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition</title>
    <author>
      <name>Jinyi Mi</name>
    </author>
    <author>
      <name>Ding Ma</name>
    </author>
    <author>
      <name>Tomoki Toda</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06200v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06200v1" />
    <published>2026-06-04T14:05:38Z</published>
    <updated>2026-06-04T14:05:38Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jinyi Mi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Zero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted to Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems</title>
    <author>
      <name>Tao Zhong</name>
    </author>
    <author>
      <name>Jiajun Deng</name>
    </author>
    <author>
      <name>Nikita Kuzmin</name>
    </author>
    <author>
      <name>Yinke Zhu</name>
    </author>
    <author>
      <name>Tianxiang Cao</name>
    </author>
    <author>
      <name>Tristan Tsoi</name>
    </author>
    <author>
      <name>Zhili Tan</name>
    </author>
    <author>
      <name>Simon Lui</name>
    </author>
    <author>
      <name>Xunying Liu</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06559v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06559v1" />
    <published>2026-06-04T12:39:44Z</published>
    <updated>2026-06-04T12:39:44Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Tao Zhong et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM&amp;#x27;s conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Geometric Second-Order Feature Correlation Learning for Self-Supervised Speech Emotion Recognition</title>
    <author>
      <name>Shuanglin Li</name>
    </author>
    <author>
      <name>Ruxiao Qian</name>
    </author>
    <author>
      <name>Siyang Song</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.06550v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.06550v1" />
    <published>2026-06-04T08:18:38Z</published>
    <updated>2026-06-04T08:18:38Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Shuanglin Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Self-supervised learning (SSL) yields powerful, context-rich representations for speech emotion recognition (SER), yet aggregating these representations into holistic descriptors remains a bottleneck. Conventional first-order aggregation implicitly assumes feature independence, which overlooks the latent Riemannian geometry and discards higher-order relationships essential to the representational power of the backbone. To address this problem, this paper proposes a novel Second-Order Correlation (SOC) layer. Instead of treating features in isolation, SOC models feature correlations as covariance descriptors to capture synergistic co-occurrence patterns, which serve as discriminative signatures for robust emotion recognition. By mapping these descriptors from the Riemannian manifold to a Euclidean tangent space through Log-Euclidean mapping (LEM), the proposed method preserves geometric integrity while enabling direct linear discriminative learning. Extensive experiments on the ESD and RAVDESS datasets demonstrate that SOC recovers discriminative information lost in first-order pooling and effectively aggregates high-dimensional SSL features.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition</title>
    <author>
      <name>Fei Su</name>
    </author>
    <author>
      <name>Cancan Li</name>
    </author>
    <author>
      <name>Ming Li</name>
    </author>
    <author>
      <name>Juan Liu</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05763v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05763v2" />
    <published>2026-06-04T06:44:54Z</published>
    <updated>2026-06-05T06:11:23Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Fei Su et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we release AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;submitted to IEEE Transactions on Audio, Speech, and Language Processing&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework</title>
    <author>
      <name>Weiguang Wang</name>
    </author>
    <author>
      <name>Fugen Wu</name>
    </author>
    <author>
      <name>Hailing Wang</name>
    </author>
    <author>
      <name>Xuechen Liang</name>
    </author>
    <author>
      <name>Xiaobin Li</name>
    </author>
    <author>
      <name>Ru Han</name>
    </author>
    <author>
      <name>Tianchang Xie</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05754v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05754v1" />
    <published>2026-06-04T06:29:25Z</published>
    <updated>2026-06-04T06:29:25Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Weiguang Wang et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Phase-sensitive optical time-domain reflectometry ($φ$-OTDR) is widely used in large-scale distributed acoustic sensing (DAS) because it provides distributed spatiotemporal monitoring over long sensing distances. Its field performance can still deteriorate because of polarization-induced fading (PIF), local signal degradation, and strong environmental interference. This study develops a Sagnac-assisted enhanced $φ$-OTDR sensing architecture and a standardized benchmark framework for engineering-oriented DAS event recognition. The Sagnac interferometer provides a continuous phase response that supplements fading-prone observations in the $φ$-OTDR channel, and heterogeneous signal alignment is achieved using a cross-correlation procedure implemented on an FPGA platform. The benchmark protocol compares conventional feature-engineering methods, probabilistic shallow classifiers, single-branch deep models, and dual-branch fusion models under consistent data partitioning, preprocessing, and metric definitions. Experiments on a 10-km sensing fiber with six representative acoustic event classes show that the dual-branch fusion model provides the most favorable trade-off among the evaluated methods, reaching 89.79\% accuracy, 89.83\% macro-F1, and a nuisance alarm rate of 5.00\% on the balanced test set. The results also show that channel grouping strongly affects dual-branch evaluation, indicating that deployment-oriented conclusions should be based on accuracy, macro-F1, nuisance alarm rate, false negative rate, and latency rather than accuracy alone. This work provides a physically motivated enhancement strategy for $φ$-OTDR-based DAS and a reproducible benchmark protocol for future fusion-oriented sensing research. The implementation and scripts for reproducing the DAS event-recognition experiments are publicly available at https://github.com/wawa-abc/das.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Do speech foundation models perceive speaker similarity as humans do?</title>
    <author>
      <name>Minoru Kishi</name>
    </author>
    <author>
      <name>Hayato Yagi</name>
    </author>
    <author>
      <name>Shinnosuke Takamichi</name>
    </author>
    <author>
      <name>Yuki Saito</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05739v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05739v2" />
    <published>2026-06-04T06:04:18Z</published>
    <updated>2026-06-05T05:57:01Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Minoru Kishi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by INTERSPEECH 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Sound Effects Dataset Unification With the Universal Category System</title>
    <author>
      <name>Jun Woo Beck</name>
    </author>
    <author>
      <name>Alexander Lerch</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05571v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05571v1" />
    <published>2026-06-04T01:46:08Z</published>
    <updated>2026-06-04T01:46:08Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jun Woo Beck, Alexander Lerch&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Sound effects (SFX) datasets and libraries often employ distinct tagging schemes, taxonomies, and metadata structures. This creates challenges for research on SFX classification and generation because incompatible taxonomies lead to siloed datasets that might require individualized approaches, result in non-comparable outcomes, and prevent data merging strategies. We propose a modular dataset relabeling framework that adopts the Universal Category System (UCS), an industry-standard hierarchical taxonomy for sound effects, as a shared structural foundation. This open-source framework enables us (i) to convert tags of existing datasets to UCS with a rule-based multi-stage pipeline and conflict resolution to achieve high automatic conversion rates, (ii) to suggest a stratified dataset split for the new labels, and (iii) to combine multiple datasets. To showcase the practical utility, we introduce the EnvSound-UCS dataset, a publicly available unified UCS-compliant dataset of environmental sounds with 58,057 sound clips from three sources: AudioSet, FSD50K, and ESC-50.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;DAFx 2026 camera-ready version&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs</title>
    <author>
      <name>Huu Tuong Tu</name>
    </author>
    <author>
      <name>Hanh Nguyen</name>
    </author>
    <author>
      <name>Thien Van Luong</name>
    </author>
    <author>
      <name>Nguyen Tien Cuong</name>
    </author>
    <author>
      <name>Vu Huan</name>
    </author>
    <author>
      <name>Nguyen Thi Thu Trang</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05569v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05569v1" />
    <published>2026-06-04T01:38:11Z</published>
    <updated>2026-06-04T01:38:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Huu Tuong Tu et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Mispronunciation Detection and Diagnosis (MDD) has gained increasing importance in computer-assisted language learning and speech technology in recent years. In this paper, we propose a method for constructing statistical graphs that enable models to learn phoneme confusion patterns represented as directed graphs. Furthermore, we introduce a language-specific strategy to capture systematic pronunciation differences across various native language (L1) backgrounds. The effectiveness of our approach is demonstrated through extensive experiments on the L2-ARCTIC benchmark, where it achieves an F1-score of 59.52%, outperforming several competitive baselines.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at Interspeech 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Exploring LLMs for South Asian Music Understanding and Generation</title>
    <author>
      <name>Faria Binte Kader</name>
    </author>
    <author>
      <name>Mohtasim Hadi Rafi</name>
    </author>
    <author>
      <name>Shah Wasif Sajjad</name>
    </author>
    <author>
      <name>Santu Karmaker</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05522v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05522v1" />
    <published>2026-06-03T23:53:27Z</published>
    <updated>2026-06-03T23:53:27Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Faria Binte Kader et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent advancements in Large Language Models (LLMs) have shown promising results in music understanding and generation tasks. However, existing works remain confined to Western tonal traditions, offering little insight into whether current LLMs can handle structurally distinct low-resource musical traditions. We present the first systematic evaluation of LLM competence in South Asian classical music, a tradition governed by raga, tala-based melodic constraints that impose fundamentally different structural principles from Western harmony-driven music. We ground our evaluation in Hindustani classical theory and Bengali classical forms, including Rabindra and Nazrul Sangeet -- representative low-resource traditions within South Asian classical music. For music understanding evaluation, we introduce a 504-question-answer benchmark spanning raga grammar, cultural knowledge, and symbolic notation reasoning, evaluating 33 LLMs where frontier models such as Gemini 2.5 Pro achieve 85-90% accuracy, while most open-source models remain in the 23-40% range. For music generation, we design a five-level controlled prompting framework and find that even the strongest model produces stylistically faithful outputs only 40% of the time. These results reveal that structural validity and stylistic faithfulness in music generation are distinct objectives and highlight an open challenge for culturally grounded music modeling.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;19 pages, 7 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] nnAudio 2: Overcoming Dynamic Compilation Barriers and Transform Inconsistencies</title>
    <author>
      <name>Abhinaba Roy</name>
    </author>
    <author>
      <name>Junyi Liang</name>
    </author>
    <author>
      <name>Dorien Herremans</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05394v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05394v1" />
    <published>2026-06-03T20:00:41Z</published>
    <updated>2026-06-03T20:00:41Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Abhinaba Roy et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;nnAudio is an open-source audio feature extraction toolbox for deep learning, but its use in current environments is hindered by TorchScript incompatibilities, inverse-transform edge cases, and dependency drift. We present a targeted modernization for modern PyTorch and scientific Python. We resolve TorchScript compilation failures in STFT and iSTFT by removing dynamic state mutation and module construction from scripted code paths and tightening argument handling in inverse-related helpers. We clarify inverse-STFT behavior by restricting reliable inversion to the uniform-bin setting (freq_scale=`no&amp;#x27;) and raising explicit runtime errors for unsupported frequency scales, preventing silently degraded reconstructions. We restore CFP compatibility with modern SciPy and ensure VQT reduces to CQT when gamma = 0. Regression tests cover the new STFT/iSTFT behaviors, and the updated codebase passes the full repository test suite in a modern Python environment. These improvements provide a more robust foundation for differentiable audio analysis in research and deployment.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Task-Vector Arithmetic for Emotional Expressivity Control in Language-Model-Based Text-to-Speech</title>
    <author>
      <name>Daniel Oliveira de Brito</name>
    </author>
    <author>
      <name>Arnaldo Candido Junior</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05367v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05367v1" />
    <published>2026-06-03T19:15:28Z</published>
    <updated>2026-06-03T19:15:28Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Daniel Oliveira de Brito, Arnaldo Candido Junior&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower operands on Qwen3-TTS-12Hz-1.7B - model weights via LoRA fine-tuning, continuous codec embeddings, discrete codec tokens, and the speaker embedding (x-vector) produced by an ECAPA-TDNN encoder jointly trained with the synthesis backbone - we localize the dominant carrier of emotional prosody to the x-vector. Building on this finding, we propose a training-free method based on centroid arithmetic in x-vector space: an emotion direction $τ= \mathbb{E}_i[x(s_i,\text{emo})] -\mathbb{E}_i[x(s_i,\text{neutral})]$ applied to an unseen target speaker as $x_{\text{new}} = x(\text{target},\text{neutral}) + α\cdotτ$. Using ESD (English) as the $τ$ source and emoUERJ (Brazilian Portuguese) as a cross-lingual ground-truth target, we observe average gains of $+0.29$ in emotion2vec cosine over the ICL baseline on English held-out speakers and $+0.09$ on Brazilian Portuguese held-out speakers, while largely preserving identity (WavLM SECS $\gtrsim 0.88$ for the multi-speaker $τ$ variant) and intelligibility (WER $\approx 0$ in PT-BR). These results offer initial evidence that the reported incompatibility of centroid-arithmetic style control with token-based TTS architectures may be circumvented when the arithmetic operates on the speaker embedding.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;10 pages, 5 figures&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Audio Interaction Model</title>
    <author>
      <name>Zhifei Xie</name>
    </author>
    <author>
      <name>Zihang Liu</name>
    </author>
    <author>
      <name>Ze An</name>
    </author>
    <author>
      <name>Xiaobin Hu</name>
    </author>
    <author>
      <name>Yue Liao</name>
    </author>
    <author>
      <name>Ziyang Ma</name>
    </author>
    <author>
      <name>Dongchao Yang</name>
    </author>
    <author>
      <name>Mingbao Lin</name>
    </author>
    <author>
      <name>Deheng Ye</name>
    </author>
    <author>
      <name>Shuicheng Yan</name>
    </author>
    <author>
      <name>Chunyan Miao</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.05121v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.05121v1" />
    <published>2026-06-03T17:26:11Z</published>
    <updated>2026-06-03T17:26:11Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Zhifei Xie et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Audio is an inherently interactive modality, yet today&amp;#x27;s Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Next generation of LALMs, work in progress&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] SURF: Separation via Unsupervised Remixing Flow</title>
    <author>
      <name>Henry Li</name>
    </author>
    <author>
      <name>Robin Scheibler</name>
    </author>
    <author>
      <name>Efthymios Tzinis</name>
    </author>
    <author>
      <name>Matt Shannon</name>
    </author>
    <author>
      <name>Arnaud Doucet</name>
    </author>
    <author>
      <name>John R. Hershey</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04921v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04921v1" />
    <published>2026-06-03T14:17:12Z</published>
    <updated>2026-06-03T14:17:12Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Henry Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The goal of single-channel source separation is to reconstruct $K$ sources given their mixture. In supervised settings where vast amounts of clean source data are available, this challenging, ill-posed problem has been addressed successfully by generative diffusion and flow-based prior models. However, access to such clean source samples is often limited, and even when available, supervised models are vulnerable to domain shifts. To bridge this gap, we present Separation via Unsupervised Remixing Flow (SURF), an unsupervised flow matching approach for source separation that learns directly from observed mixtures. This method relies on a novel combination of state-of-the-art supervised flow matching and regression-based self-supervised techniques. At a high level, starting from a teacher model, we utilize a &amp;quot;remixing&amp;quot; step to bootstrap the learning of a student flow model from the teacher&amp;#x27;s estimates. We provide insights into the objectives optimized by this approach and draw a novel connection to the Wake-Sleep algorithm. Empirical evaluations on image and audio benchmarks demonstrate that SURF establishes a new state-of-the-art, significantly outperforming existing unsupervised methods. See our demo page for examples. https://google.github.io/df-conformer/surf/&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted at ICML 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Gauss Circle Lattices with Geometric Convolutions for Synthesizing High Dimensional Image-Source Room Impulse Responses</title>
    <author>
      <name>Yuancheng Luo</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04358v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04358v1" />
    <published>2026-06-03T02:18:39Z</published>
    <updated>2026-06-03T02:18:39Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Yuancheng Luo&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The image-source model (ISM) is a widely adopted method for efficiently simulating acoustic room impulse responses (RIRs) under specular reflection assumptions. Acoustic paths between source and receiver are traced to lattice points computed from successive reflections over bounding planes of the room. Rectangular rooms bound the total number of image-sources to be polynomial in the RIR&amp;#x27;s duration or distance $k$ equivalent, with degree equal the number of room dimensions $N$. Direct ISM simulations are therefore compute upper-bound by $O \left ( k^N \right )$, and consider only cases of $N \leq 3$ for tractability and real-world applications. This work proposes an alternative computational method that lowers the asymptotic compute bound to $O \left ( N k^2 \log k \right )$ for integer coordinates and room dimensions via reducing ISM lattice point counting to the classic Gauss circle problem (GCP). We extend the lattice counting model to frequency-dependent and reflection weighted image-sources in higher dimensions, relating solutions between successive dimensions via the convolution operator. Two constructions for realizing RIRs are presented, along with time-frequency controls, error and run-time analysis, and RIR statistics.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted for publication at the 29th International Conference on Digital Audio Effects 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Feasibility of Time-Domain DNN-Based Speech Enhancement on Embedded FPGA for Hearing Aid</title>
    <author>
      <name>Feyisayo Olalere</name>
    </author>
    <author>
      <name>Umut Altin</name>
    </author>
    <author>
      <name>Kiki van der Heijden</name>
    </author>
    <author>
      <name>Marcel van Gerven</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04221v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04221v1" />
    <published>2026-06-02T21:17:00Z</published>
    <updated>2026-06-02T21:17:00Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Feyisayo Olalere et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Hearing aids impose strict latency and power constraints that current DNN-based speech enhancement systems struggle to meet on embedded hardware. We characterize this gap by deploying both speech separation and denoising using the lightweight SuDoRM-RF++ architecture on the AMD-Xilinx Kria KV260, evaluated at FP32 and 16-bit fixed-point precision for each task. Across these configurations, first-sample latency tracks with on-chip parameter caching rather than arithmetic throughput, identifying data movement as the primary bottleneck. Precision reduction halves the model memory footprint without compromising objective speech quality. The fixed-point denoising accelerator achieves a first-sample latency of 9.7~ms, meeting the 10~ms clinical threshold, while speech separation reaches 16.0~ms. These measurements establish concrete resource requirements for embedded DNN-based speech enhancement and quantify the remaining gap to hearing aid deployment.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;13 pages&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.MM] DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities</title>
    <author>
      <name>Sajad Ebrahimi</name>
    </author>
    <author>
      <name>Nima Jamali</name>
    </author>
    <author>
      <name>Bardia Shirsalimian</name>
    </author>
    <author>
      <name>Kelly McConvey</name>
    </author>
    <author>
      <name>Wentao Zhang</name>
    </author>
    <author>
      <name>Jalehsadat Mahdavimoghaddam</name>
    </author>
    <author>
      <name>Maksym Taranukhin</name>
    </author>
    <author>
      <name>Maura Grossman</name>
    </author>
    <author>
      <name>Vered Shwartz</name>
    </author>
    <author>
      <name>Yuntian Deng</name>
    </author>
    <author>
      <name>Ebrahim Bagheri</name>
    </author>
    <category term="cs.MM" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04205v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04205v1" />
    <published>2026-06-02T20:49:20Z</published>
    <updated>2026-06-02T20:49:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Sajad Ebrahimi et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.CL] Efficient ASR Training with Conversations that Never Happened</title>
    <author>
      <name>Máté Gedeon</name>
    </author>
    <author>
      <name>Péter Mihajlik</name>
    </author>
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03957v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03957v1" />
    <published>2026-06-02T17:46:12Z</published>
    <updated>2026-06-02T17:46:12Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Máté Gedeon, Péter Mihajlik&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] LiveBand: Live Accompaniment Generation in the Audio Domain</title>
    <author>
      <name>Marco Pasini</name>
    </author>
    <author>
      <name>Javier Nistal</name>
    </author>
    <author>
      <name>Ben Hayes</name>
    </author>
    <author>
      <name>Mathias Rose Bjare</name>
    </author>
    <author>
      <name>Stefan Lattner</name>
    </author>
    <author>
      <name>George Fazekas</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03803v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03803v2" />
    <published>2026-06-02T15:50:13Z</published>
    <updated>2026-06-09T17:04:13Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Marco Pasini et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision from a discriminator. At each timestep, the generator receives only the causally available mix context and Gaussian noise, and predicts accompaniment latents without access to future mix frames or ground-truth target latents. Training is performed in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The model&amp;#x27;s training and inference computations are matched by design, eliminating teacher forcing and the associated exposure bias. On a multi-instrument music accompaniment benchmark, LiveBand improves over prior work on objective measures of audio quality, beat alignment, and mix adherence, while enabling real-time streaming generation without lookahead into the future on consumer hardware.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation</title>
    <author>
      <name>Ye Tao</name>
    </author>
    <author>
      <name>Lupeng Liu</name>
    </author>
    <author>
      <name>Xuenan Xu</name>
    </author>
    <author>
      <name>Jiasun Feng</name>
    </author>
    <author>
      <name>Jiarui Wang</name>
    </author>
    <author>
      <name>Ying Qin</name>
    </author>
    <author>
      <name>Shuiyang Mao</name>
    </author>
    <author>
      <name>Wei Liu</name>
    </author>
    <author>
      <name>Shuai Wang</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03672v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03672v1" />
    <published>2026-06-02T13:56:31Z</published>
    <updated>2026-06-02T13:56:31Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Ye Tao et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recent unified audio generation models can support diverse tasks across speech, sound effects, and music, but most of them still focus on isolated task-level synthesis. However, real video production often requires multiple components of a complete audio track to be generated jointly and consistently for the same video. We present Foley-Omni, a unified multimodal audio generation model that extends isolated task-level synthesis to complete video soundtrack generation by jointly modeling speech, sound effects, and music within a shared latent generation process. To support training and reproducible evaluation, we develop an audiovisual data curation pipeline and introduce V2ST-Bench, a benchmark for holistic video soundtrack generation evaluation. Experiments show that Foley-Omni achieves competitive performance with expert systems on individual synthesis tasks, while improving speech intelligibility, audiovisual consistency and perceptual quality for mixed soundtrack generation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling</title>
    <author>
      <name>Wenxi Chen</name>
    </author>
    <author>
      <name>Dongya Jia</name>
    </author>
    <author>
      <name>Yushen Chen</name>
    </author>
    <author>
      <name>Zhikang Niu</name>
    </author>
    <author>
      <name>Yuzhe Liang</name>
    </author>
    <author>
      <name>Xiquan Li</name>
    </author>
    <author>
      <name>Ruiqi Yan</name>
    </author>
    <author>
      <name>Ziyang Ma</name>
    </author>
    <author>
      <name>Guanrou Yang</name>
    </author>
    <author>
      <name>Sanyuan Chen</name>
    </author>
    <author>
      <name>Yue Wang</name>
    </author>
    <author>
      <name>Zhuo Chen</name>
    </author>
    <author>
      <name>Kai Yu</name>
    </author>
    <author>
      <name>Xie Chen</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03455v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03455v1" />
    <published>2026-06-02T10:33:20Z</published>
    <updated>2026-06-02T10:33:20Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Wenxi Chen et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long sequence length of audio signals. To overcome this, we propose WavTTS, the first raw waveform generative TTS model that substantially narrows the gap with latent-space generative models. Built upon the flow matching with Diffusion Transformer (DiT), WavTTS directly models speech waveforms via a simple patchification strategy, while integrating multi-scale mel-spectrogram supervision to provide perceptual guidance during training. Furthermore, we investigate the impact of prediction targets and noise scheduling in waveform diffusion, and develop an effective schedule design to improve generation quality. Evaluations on open-source benchmarks demonstrate that WavTTS closely approaches the performance of current state-of-the-art latent generative zero-shot TTS models, while substantially outperforming previous end-to-end speech generation models. Our findings demonstrate the feasibility of scaling diffusion-based TTS directly in the waveform space, opening a new direction for end-to-end speech generation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Speech Emotion Recognition using Attention-based LSTM-Network with Residual Connection</title>
    <author>
      <name>Daniil Krasnoproshin</name>
    </author>
    <author>
      <name>Maxim Vashkevich</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03359v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03359v1" />
    <published>2026-06-02T09:08:59Z</published>
    <updated>2026-06-02T09:08:59Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Daniil Krasnoproshin, Maxim Vashkevich&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Speech emotion recognition is an important component of modern human-computer interaction systems. However, many state-of-the-art approaches rely on large pretrained models with high computational and memory requirements, limiting their applicability. This paper proposes ResLSTM-SA, a lightweight architecture that integrates residual connections with soft attention within an LSTM-based framework. Evaluated on the RAVDESS dataset under strict speaker-independent partitioning, the proposed model outperforms conventional attention-based LSTM baselines and several previously reported CNN- and hybrid CNN-LSTM architectures in terms of unweighted average recall (UAR). The best-performing variant (ResLSTM-SA-h64) achieves a maximum UAR of 0.6517 with only 46.8k trainable parameters, delivering competitive accuracy with three orders of magnitude fewer parameters than large-scale self-supervised alternatives, thereby enabling efficient deployment on edge devices and real-time voice assistants. The source code is available at https://github.com/Mak-Sim/ResLSTM-SER.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;6 pages, 5 figures, DSPA 2026&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] SpeakerCard-1M: An Evidence-Grounded Speaker Card Corpus for In-the-Wild Speaker Verification</title>
    <author>
      <name>Junyi Peng</name>
    </author>
    <author>
      <name>Oldřich Plchot</name>
    </author>
    <author>
      <name>Xiao Song</name>
    </author>
    <author>
      <name>Dading Chong</name>
    </author>
    <author>
      <name>Lichun Fan</name>
    </author>
    <author>
      <name>Hang Su</name>
    </author>
    <author>
      <name>Themos Stafylakis</name>
    </author>
    <author>
      <name>Junjie Li</name>
    </author>
    <author>
      <name>Kong Aik Lee</name>
    </author>
    <author>
      <name>Shuai Wang</name>
    </author>
    <author>
      <name>Jian Luan</name>
    </author>
    <author>
      <name>Jan Černocký</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03283v2</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03283v2" />
    <published>2026-06-02T07:49:30Z</published>
    <updated>2026-06-03T09:14:41Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Junyi Peng et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Modern speaker verification (SV) systems rely on speaker embeddings that are effective but difficult to interpret or query in natural language. Most existing speech-text corpora target controllable synthesis or utterance-level captioning, and provide limited speaker-level supervision for in-the-wild speaker recognition. This paper introduces SpeakerCard-1M, a bilingual speaker-centric resource for evidence-grounded SV, derived from VoxCeleb1/2 and CN-Celeb1/2, where the &amp;quot;-1M&amp;quot; suffix refers to the 1.78M utterance-level captions contained in the release. We adopt a tool-first, LLM-last approach: ten acoustic probes produce field-level evidence, the evidence is aggregated into speaker profiles under a schema that separates relatively stable traits from utterance-level states, and bilingual Speaker Cards are rendered by a constrained LLM that sees only the structured fields. The release includes 56.7K Speaker Card records over 10.2K speakers, 1.78M utterance-level captions, and speaker-ID-disjoint hard-negative triplets. We further define two SV-oriented cross-modal protocols, bidirectional Speaker-Text Retrieval (T2S-R / S2T-R) and Attribute-Conditioned Verification (AC-Verify), and compare a dual-encoder baseline against recent audio language models under a zero-shot forced-choice setting. Joint audio-text training increases VoxCeleb1-O EER by 0.31% absolute over the audio-only baseline. Under a style-symmetric LLM-generated counterfactual protocol, eight recent audio language models (7B-30B+ parameters, both open- and closed-source) score 49-77% on pitch-level AC-Verify under two-way forced choice, compared with 88.66% reached by our dual encoder.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Corpus and protocols at https://junyipeng00.github.io/SpeakerCard-1M-page&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.MM] Inference-Time Scaling for Joint Audio-Video Generation</title>
    <author>
      <name>Jaemin Jung</name>
    </author>
    <author>
      <name>Kyeongha Rho</name>
    </author>
    <author>
      <name>Inkyu Shin</name>
    </author>
    <author>
      <name>Joon Son Chung</name>
    </author>
    <category term="cs.MM" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03183v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03183v1" />
    <published>2026-06-02T05:41:41Z</published>
    <updated>2026-06-02T05:41:41Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jaemin Jung et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs. Synthesized samples and code are available on the project page: https://jung-jaemin.github.io/ITS-AVGen-Proj.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;Accepted by Transactions on Machine Learning Research (TMLR). Project page: https://jung-jaemin.github.io/ITS-AVGen-Proj/&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[cs.SD] Channel-Oriented Design for EEG-to-Music Reconstruction</title>
    <author>
      <name>Jiaxin Qing</name>
    </author>
    <author>
      <name>Junwei Lu</name>
    </author>
    <author>
      <name>Lexin Li</name>
    </author>
    <category term="cs.SD" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.04040v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.04040v1" />
    <published>2026-06-02T04:13:37Z</published>
    <updated>2026-06-02T04:13:37Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Jiaxin Qing et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;Brain-computer interfaces aim to decode naturalistic stimuli from neural signals, yet most progress to date has focused on vision and language. In this article, we study a more challenging but far less explored setting, EEG-to-music reconstruction, where signals are weak, distributed, and highly susceptible to noise and channel variability. Our central finding is that early channel mixing destroys weak but discriminative EEG signals. To address this, we propose a channel-oriented design with three key components. Specifically, channel-wise tokenization treats each electrode as an explicit token to retain spatially localized neural evidence, channel-wise multi-view self-distillation enforces consistency across temporal crops and random channel subsets to learn robust and distributed representations, and channel-wise data augmentation introduces structured channel dropout to improve invariance to noise, artifacts, and missing electrodes. Together, these components preserve weak yet informative signals across channels and enable stable alignment to a semantic music representation space. We integrate this channel-oriented design within an encoding-alignment-decoding pipeline for EEG-to-music reconstruction. Theoretically, we characterize when preserving channel-level structure leads to improved alignment. Empirically, we compare with a range of state-of-the-art baselines and demonstrate consistent and significant performance gains.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <entry>
    <title>[eess.AS] AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following</title>
    <author>
      <name>Haitao Li</name>
    </author>
    <author>
      <name>Tian Tan</name>
    </author>
    <author>
      <name>Yuguang Yang</name>
    </author>
    <author>
      <name>Shan Yang</name>
    </author>
    <author>
      <name>Xie Chen</name>
    </author>
    <category term="eess.AS" scheme="http://arxiv.org/schemas/atom" />
    <id>https://arxiv.org/abs/2606.03116v1</id>
    <link rel="alternate" type="text/html" href="https://arxiv.org/abs/2606.03116v1" />
    <published>2026-06-02T04:00:32Z</published>
    <updated>2026-06-02T04:00:32Z</updated>
    <content type="html">&lt;h3&gt;Authors:&lt;/h3&gt;
&lt;p&gt;Haitao Li et al.&lt;/p&gt;
&lt;h3&gt;Abstract:&lt;/h3&gt;
&lt;p&gt;The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general-purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fail to capture fine-grained attribute mismatches. To address this, we introduce a novel dynamic rubric-based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. To rigorously benchmark this capability, we propose the AnyAudio-Judge Bench, a comprehensive, bilingual benchmark comprising 7,920 meticulously curated samples across four diverse audio domains (speech, sound, music, and mixed), featuring deliberately constructed hard negatives. Furthermore, we construct a large-scale corpus of 105K samples with explicit Chain-of-Thought (CoT) rationales to train our dedicated evaluator, the AnyAudio-Judge model. By employing a training pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), our model successfully aligns its reasoning paths with the rubric-based scoring mechanism. Extensive experiments demonstrate that AnyAudio-Judge not only significantly enhances zero-shot alignment detection compared to state-of-the-art baselines, but also provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation.&lt;/p&gt;
&lt;h3&gt;Comments:&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;</content>
  </entry>
  <code-available-feed:processed>
    <code-available-feed:article url="https://arxiv.org/abs/2605.31530v2" updated="2026-06-02T14:24:03Z" repo_found_in="pdf" repo_urls="https://huggingface.co/alefiury/ https://huggingface.co/alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech https://lizhaoqing.github.io/UNISON-demo/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.01802v2" updated="2026-06-02T08:35:57Z" repo_found_in="pdf" repo_urls="https://huggingface.co/collections/OpenMOSS-Team/moss-audio https://openmoss.github.io/MOSS-Audio/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.01802v3" updated="2026-06-05T13:33:35Z" repo_found_in="pdf" repo_urls="https://huggingface.co/collections/OpenMOSS-Team/moss-audio https://openmoss.github.io/MOSS-Audio/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.01804v2" updated="2026-06-03T03:45:51Z" repo_found_in="abstract" repo_urls="https://github.com/daxintan-cuhk/SpeechEditBench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.02341v2" updated="2026-06-05T18:03:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.02980v1" updated="2026-06-02T00:38:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03028v1" updated="2026-06-02T02:07:29Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03116v1" updated="2026-06-02T04:00:32Z" repo_found_in="pdf" repo_urls="https://github.com/CuCl-2/AnyAudio-Judge" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03169v1" updated="2026-06-02T05:27:56Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03183v1" updated="2026-06-02T05:41:41Z" repo_found_in="comment" repo_urls="https://jung-jaemin.github.io/ITS-AVGen-Proj/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03283v2" updated="2026-06-03T09:14:41Z" repo_found_in="comment" repo_urls="https://junyipeng00.github.io/SpeakerCard-1M-page" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03359v1" updated="2026-06-02T09:08:59Z" repo_found_in="abstract" repo_urls="https://github.com/Mak-Sim/ResLSTM-SER" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03455v1" updated="2026-06-02T10:33:20Z" repo_found_in="pdf" repo_urls="https://github.com/cwx-worst-one/WavTTS https://wavtts.github.io" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03459v1" updated="2026-06-02T10:36:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03672v1" updated="2026-06-02T13:56:31Z" repo_found_in="pdf" repo_urls="https://ty0402.github.io/Foley-omni-Web/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03803v1" updated="2026-06-02T15:50:13Z" repo_found_in="pdf" repo_urls="https://sonycslparis.github.io/liveband-companion" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03803v2" updated="2026-06-09T17:04:13Z" repo_found_in="pdf" repo_urls="https://sonycslparis.github.io/liveband-companion" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.03957v1" updated="2026-06-02T17:46:12Z" repo_found_in="pdf" repo_urls="https://huggingface.co/coqui/XTTS-v2 https://huggingface.co/nvidia/stt_en_ https://huggingface.co/nvidia/stt_en_fastconformer_ctc_large https://huggingface.co/openai/ https://huggingface.co/openai/whisper-large-v3" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04040v1" updated="2026-06-02T04:13:37Z" repo_found_in="pdf" repo_urls="https://github.com/jqin4749/EEG-to-Music" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04103v1" updated="2026-06-02T18:09:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04205v1" updated="2026-06-02T20:49:20Z" repo_found_in="abstract" repo_urls="https://github.com/sadjadeb/DetectZoo" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04210v1" updated="2026-06-02T20:56:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04221v1" updated="2026-06-02T21:17:00Z" repo_found_in="pdf" repo_urls="https://github.com/umutcanaltin/audio_task/tree/main" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04358v1" updated="2026-06-03T02:18:39Z" repo_found_in="pdf" repo_urls="https://github.com/yluo1/GCP-ISM" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04370v1" updated="2026-06-03T02:34:45Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04418v1" updated="2026-06-03T03:56:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04475v1" updated="2026-06-03T05:44:27Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04570v1" updated="2026-06-03T08:03:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04584v1" updated="2026-06-03T08:20:15Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04680v1" updated="2026-06-03T10:03:19Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04844v1" updated="2026-06-03T13:12:34Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.04921v1" updated="2026-06-03T14:17:12Z" repo_found_in="abstract" repo_urls="https://google.github.io/df-conformer/surf/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05101v1" updated="2026-06-03T17:04:26Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05121v1" updated="2026-06-03T17:26:11Z" repo_found_in="pdf" repo_urls="https://github.com/xzf-thu/Audio-Interaction https://huggingface.co/datasets/zhifeixie/StreamAudio-2M https://xzf-thu.github.io/Audio-Interaction" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05161v1" updated="2026-06-03T17:57:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05367v1" updated="2026-06-03T19:15:28Z" repo_found_in="pdf" repo_urls="https://github.com/danielbrito91/xvector-emotion-arithmetic" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05394v1" updated="2026-06-03T20:00:41Z" repo_found_in="pdf" repo_urls="https://github.com/AMAAI-Lab/ https://github.com/AMAAI-Lab/nnAudio2" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05522v1" updated="2026-06-03T23:53:27Z" repo_found_in="pdf" repo_urls="https://github.com/Faria-Binte-Kader/ https://github.com/Faria-Binte-Kader/South-Asian-Music-data" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05544v1" updated="2026-06-04T00:58:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05569v1" updated="2026-06-04T01:38:11Z" repo_found_in="pdf" repo_urls="https://huggingface.co/facebook/ https://huggingface.co/facebook/wav2vec2-large-xlsr-53" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05571v1" updated="2026-06-04T01:46:08Z" repo_found_in="pdf" repo_urls="https://github.com/JunWooBeck/audioset-ucs https://github.com/JunWooBeck/envsound-ucs https://github.com/JunWooBeck/esc50-ucs https://github.com/JunWooBeck/fsd50k-ucs https://github.com/JunWooBeck/ucs-sfx-tools" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05575v1" updated="2026-06-04T01:50:12Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05678v1" updated="2026-06-04T04:00:48Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05713v1" updated="2026-06-04T05:12:36Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05739v1" updated="2026-06-04T06:04:18Z" repo_found_in="pdf" repo_urls="https://github.com/0nutation/SpeechGPT https://github.com/Audio-WestlakeU/audiossl/ https://github.com/Audio-WestlakeU/audiossl/blob/main/audiossl/methods/ATST-Frame/README.md https://github.com/Plachtaa/VALL-E-X https://github.com/QwenLM/Qwen3-TTS https://github.com/YuanGongND/ast https://github.com/microsoft/SpeechT5 https://github.com/microsoft/SpeechT5,https https://github.com/openai/whisper https://github.com/theolepage/ https://huggingface.co/docs/transformers/ https://huggingface.co/docs/transformers/model_doc/wavlm,https://github.com/theolepage/wavlm_ssl_sv/blob/main/README.md https://huggingface.co/facebook/models https://huggingface.co/microsoft/speecht5_tts https://huggingface.co/nvidia/models" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05739v2" updated="2026-06-05T05:57:01Z" repo_found_in="pdf" repo_urls="https://github.com/0nutation/SpeechGPT https://github.com/Audio-WestlakeU/audiossl/ https://github.com/Audio-WestlakeU/audiossl/blob/main/audiossl/methods/ATST-Frame/README.md https://github.com/Plachtaa/VALL-E-X https://github.com/QwenLM/Qwen3-TTS https://github.com/YuanGongND/ast https://github.com/microsoft/SpeechT5 https://github.com/microsoft/SpeechT5,https https://github.com/openai/whisper https://github.com/theolepage/ https://huggingface.co/docs/transformers/ https://huggingface.co/docs/transformers/model_doc/wavlm,https://github.com/theolepage/wavlm_ssl_sv/blob/main/README.md https://huggingface.co/facebook/models https://huggingface.co/microsoft/speecht5_tts https://huggingface.co/nvidia/models" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05754v1" updated="2026-06-04T06:29:25Z" repo_found_in="abstract" repo_urls="https://github.com/wawa-abc/das" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05763v1" updated="2026-06-04T06:44:54Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/SMIIP-lab/AISHELL8-RealScene" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05763v2" updated="2026-06-05T06:11:23Z" repo_found_in="pdf" repo_urls="https://huggingface.co/datasets/SMIIP-lab/AISHELL8-RealScene" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05852v1" updated="2026-06-04T08:27:17Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05889v1" updated="2026-06-04T08:58:57Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05909v1" updated="2026-06-04T09:14:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.05911v1" updated="2026-06-04T09:16:26Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06037v1" updated="2026-06-04T11:31:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06037v2" updated="2026-06-08T08:49:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06065v1" updated="2026-06-04T12:07:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06065v2" updated="2026-06-05T03:19:33Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06200v1" updated="2026-06-04T14:05:38Z" repo_found_in="pdf" repo_urls="https://huggingface.co/TencentGameMate/chinese-wav2vec2-base https://huggingface.co/facebook/wav2vec2-base-960h https://huggingface.co/facebook/wav2vec2-base-de-voxpopuli-v2 https://huggingface.co/facebook/wav2vec2-base-fr-voxpopuli" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06211v1" updated="2026-06-04T14:20:11Z" repo_found_in="pdf" repo_urls="https://github.com/ferugit/film-spk-asr https://github.com/wenet-e2e/wespeaker" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06357v1" updated="2026-06-04T16:25:07Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06444v1" updated="2026-06-04T17:42:05Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06550v1" updated="2026-06-04T08:18:38Z" repo_found_in="pdf" repo_urls="https://github.com/secret-code-source/SOC" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06559v1" updated="2026-06-04T12:39:44Z" repo_found_in="pdf" repo_urls="https://github.com/snakers4/silero-vad https://huggingface.co/datasets/ICTNLP/Instr https://huggingface.co/datasets/ICTNLP/InstructS2S-200K https://microsoft.github.io/msmarco/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06615v1" updated="2026-06-04T18:05:39Z" repo_found_in="comment" repo_urls="https://nishitanand.github.io/figma-website/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06740v1" updated="2026-06-04T21:54:56Z" repo_found_in="pdf" repo_urls="https://github.com/AI4Bharat/IndicConformerASR https://github.com/AI4Bharat/IndicMFA" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06743v1" updated="2026-06-04T21:57:18Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06795v1" updated="2026-06-05T00:45:28Z" repo_found_in="pdf" repo_urls="https://github.com/Hanyu-Meng/BiEAR" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06806v1" updated="2026-06-05T01:14:12Z" repo_found_in="pdf" repo_urls="https://github.com/interactiveaudiolab/ppgs https://ondatk68.github.io/onda-demo/ https://ondatk68.github.io/onda-demo/projects/soft-token-inference/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06907v1" updated="2026-06-05T04:50:34Z" repo_found_in="pdf" repo_urls="https://sakshi113.github.io/mmau_homepage/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06921v1" updated="2026-06-05T05:35:03Z" repo_found_in="comment" repo_urls="https://github.com/bohanhu118/Interspeech2026_ESAS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06928v1" updated="2026-06-05T05:43:15Z" repo_found_in="comment" repo_urls="https://github.com/OpenBMB/VoxCPM" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06940v1" updated="2026-06-05T06:11:38Z" repo_found_in="abstract" repo_urls="https://github.com/zxzhao0/CogAudio-LLM" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.06975v1" updated="2026-06-05T07:07:22Z" repo_found_in="pdf" repo_urls="https://github.com/mun3im/MyGardenBird" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07015v1" updated="2026-06-05T07:59:17Z" repo_found_in="pdf" repo_urls="https://github.com/RVC-Boss/GPT-SoVITS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07030v1" updated="2026-06-05T08:19:51Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07080v1" updated="2026-06-05T09:19:24Z" repo_found_in="pdf" repo_urls="https://github.com/rednote-hilab/dots.tts https://huggingface.co/collections/rednote-hilab/dotstts https://rednote-hilab.github.io/dots.tts-demo" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07207v1" updated="2026-06-05T12:19:00Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07210v1" updated="2026-06-05T12:21:25Z" repo_found_in="pdf" repo_urls="https://github.com/OraneD/Speaker-Linkability https://github.com/V https://github.com/deep-privacy/sidekit https://github.com/kiwano-toolkit/kiwano/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07229v1" updated="2026-06-05T12:52:41Z" repo_found_in="comment" repo_urls="https://github.com/ddlBoJack/MMAE" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07240v1" updated="2026-06-05T13:09:21Z" repo_found_in="pdf" repo_urls="https://github.com/fishaudio/fish-speech https://huggingface.co/datasets/ymoslem/acl-6060 https://huggingface.co/facebook/mms-1b-all https://huggingface.co/microsoft/VibeVoice-ASR https://huggingface.co/microsoft/wavlm-base-plus-sv https://huggingface.co/openai/whisper-large-v3 https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07259v1" updated="2026-06-05T13:35:10Z" repo_found_in="pdf" repo_urls="https://github.com/chaufanglin/mv2lrs3 https://github.com/yakhyo/uniface" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07271v1" updated="2026-06-05T13:46:37Z" repo_found_in="pdf" repo_urls="https://github.com/sourisimos/rectified-flow-membership" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07293v1" updated="2026-06-05T14:05:19Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07309v1" updated="2026-06-05T14:26:06Z" repo_found_in="pdf" repo_urls="https://huggingface.co/mispeech/midashenglm-7b-0804-fp32 https://huggingface.co/nvidia/audio-flamingo-3# https://huggingface.co/nvidia/audio-flamingo-3#think-mode-reasoning-with-peft-adapter-af-think" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07334v1" updated="2026-06-05T14:49:24Z" repo_found_in="pdf" repo_urls="https://huggingface.co/PearlLeeStudio" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07356v1" updated="2026-06-05T15:04:22Z" repo_found_in="pdf" repo_urls="https://directaudioedit.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07397v1" updated="2026-06-05T15:38:08Z" repo_found_in="abstract" repo_urls="https://audiooscar.github.io/ https://github.com/ziye26/Audio-Oscar" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07473v1" updated="2026-06-05T17:26:23Z" repo_found_in="pdf" repo_urls="https://github.com/audiosae/audio-sae https://huggingface.co/Egorgij21/ https://huggingface.co/Egorgij21/Audio-SAE-Whisper-large-v3 https://huggingface.co/Egorgij21/Audio-SAE-Whisper-small" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07494v1" updated="2026-06-05T17:48:46Z" repo_found_in="pdf" repo_urls="https://huggingface.co/nii-yamagishilab/xls-r-2b-anti-deepfake" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.07673v1" updated="2026-06-04T17:39:33Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08038v1" updated="2026-06-06T07:58:02Z" repo_found_in="pdf" repo_urls="https://huggingface.co/facebook/wav2vec2-xls-r-300m" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08078v1" updated="2026-06-06T09:55:37Z" repo_found_in="pdf" repo_urls="https://github.com/kiwano-toolkit/kiwano" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08087v1" updated="2026-06-06T10:23:18Z" repo_found_in="pdf" repo_urls="https://github.com/kiwano-toolkit/kiwano" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08210v1" updated="2026-06-06T14:54:44Z" repo_found_in="pdf" repo_urls="https://huggingface.co/facebook/wav2vec2-base-960h" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08286v1" updated="2026-06-06T18:14:41Z" repo_found_in="comment" repo_urls="https://anniejchu.github.io/fxplorer/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08385v1" updated="2026-06-07T00:44:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08425v1" updated="2026-06-07T02:50:24Z" repo_found_in="comment" repo_urls="https://interspeech-tinygiant-alm.vercel.app" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08505v1" updated="2026-06-07T08:10:14Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08580v1" updated="2026-06-07T11:28:32Z" repo_found_in="pdf" repo_urls="https://github.com/Hello3orld/G-MaP-SE" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08663v1" updated="2026-06-07T15:08:19Z" repo_found_in="abstract" repo_urls="https://github.com/MAAP-LAB/CoMoE" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08669v1" updated="2026-06-07T15:20:38Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08678v1" updated="2026-06-07T15:31:00Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08722v1" updated="2026-06-07T16:32:59Z" repo_found_in="abstract" repo_urls="https://github.com/CSCPadova/lilybench" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.08843v1" updated="2026-06-07T21:25:14Z" repo_found_in="abstract" repo_urls="https://palindromic-vc.github.io" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09019v1" updated="2026-06-08T04:32:08Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09048v1" updated="2026-06-08T05:36:42Z" repo_found_in="abstract" repo_urls="https://barewave.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09050v1" updated="2026-06-08T05:39:23Z" repo_found_in="pdf" repo_urls="https://aslp-lab.github.io/MeanVC2/ https://github.com/BytedanceSpeech/seed-tts-eval https://github.com/microsoft/DNS-Challenge https://huggingface.co/funasr/paraformer-zh" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09141v1" updated="2026-06-08T07:39:26Z" repo_found_in="pdf" repo_urls="https://aslp-lab.github.io/flashtts_demo https://github.com/ASLP-lab/FlashTTS https://github.com/BytedanceSpeech/seed-tts-eval https://github.com/modelscope/FunASR https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09141v2" updated="2026-06-09T03:52:24Z" repo_found_in="pdf" repo_urls="https://aslp-lab.github.io/flashtts_demo https://github.com/ASLP-lab/FlashTTS https://github.com/BytedanceSpeech/seed-tts-eval https://github.com/modelscope/FunASR https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09234v1" updated="2026-06-08T09:07:23Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09266v1" updated="2026-06-08T09:37:44Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09271v1" updated="2026-06-08T09:39:33Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09535v1" updated="2026-06-08T14:18:51Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/anoopkunchukuttan/indic_nlp_library" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09553v1" updated="2026-06-08T14:30:48Z" repo_found_in="pdf" repo_urls="https://github.com/McGill-NLP/open-bible-resources https://github.com/McGill-NLP/open-bible-tts https://github.com/jitsi/jiwer https://huggingface.co/multilingual-tts" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09667v1" updated="2026-06-08T15:50:51Z" repo_found_in="pdf" repo_urls="https://github.com/hitz-zentroa/ahoNT" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09717v1" updated="2026-06-08T16:43:37Z" repo_found_in="pdf" repo_urls="https://huggingface.co/Qwen/ https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09780v1" updated="2026-06-08T17:40:09Z" repo_found_in="comment" repo_urls="https://doi.org/10.1007/978-3-031-56992-0_14" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09925v1" updated="2026-06-07T12:24:18Z" repo_found_in="pdf" repo_urls="https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct https://huggingface.co/Qwen/Qwen2.5-Omni-3B https://huggingface.co/Qwen/Qwen2.5-Omni-7B https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Thinking https://huggingface.co/google/gemma-3n-E2B-it https://huggingface.co/google/gemma-3n-E4B-it https://huggingface.co/google/gemma-4-E2B-it https://huggingface.co/google/gemma-4-E4B-it https://huggingface.co/microsoft/Phi-4-multimodal-instruct https://huggingface.co/stepfun-ai/Step-Audio-R1" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09962v1" updated="2026-06-08T14:41:24Z" repo_found_in="pdf" repo_urls="https://github.com/FunAudioLLM/CosyVoice https://github.com/SWivid/F5-TTS https://github.com/li1jkdaw/ https://github.com/li1jkdaw/CDCD-TTS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.09966v1" updated="2026-06-08T16:29:59Z" repo_found_in="pdf" repo_urls="https://github.com/AIoT-MLSys-Lab/RespiraMFM https://huggingface.co/microsoft/phi-2 https://respiramfm.github.io/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10010v1" updated="2026-06-08T18:01:20Z" repo_found_in="pdf" repo_urls="https://github.com/JethroWangSir/DeRA-MOS" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10046v1" updated="2026-06-08T18:18:28Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10147v1" updated="2026-06-08T20:26:09Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10213v1" updated="2026-06-08T22:07:59Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10223v1" updated="2026-06-08T22:22:48Z" repo_found_in="pdf" repo_urls="https://github.com/ https://github.com/piotrkawa/ https://github.com/piotrkawa/audio-deepfake-source-tracing" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10231v1" updated="2026-06-08T22:44:04Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10233v1" updated="2026-06-08T22:46:30Z" repo_found_in="pdf" repo_urls="https://huggingface.co/espnet/arecho_scale_ https://huggingface.co/espnet/arecho_scale_v0.1-large-decoder" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10246v1" updated="2026-06-08T23:26:39Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10278v1" updated="2026-06-09T00:59:43Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10317v1" updated="2026-06-09T02:14:11Z" repo_found_in="pdf" repo_urls="https://github.com/bshall/knn-vc https://github.com/openai/whisper https://github.com/tomoya-san/ssl-gmmvc https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10360v1" updated="2026-06-09T03:21:40Z" repo_found_in="abstract" repo_urls="https://github.com/khanld/chunkformer" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10365v1" updated="2026-06-09T03:24:24Z" repo_found_in="pdf" repo_urls="https://github.com/Kyubyong/ https://github.com/Kyubyong/g2p https://github.com/gusrud1103/LibriPhrase https://github.com/gusrud1103/LibriPhrase.git" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10368v1" updated="2026-06-09T03:27:30Z" repo_found_in="abstract" repo_urls="https://github.com/Sslnon/ELF-S2T" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10407v1" updated="2026-06-09T04:31:30Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10439v1" updated="2026-06-09T05:35:31Z" repo_found_in="pdf" repo_urls="https://github.com/mubingshen/MLC-SLM-Baseline" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10454v1" updated="2026-06-09T06:02:31Z" repo_found_in="pdf" repo_urls="https://huggingface.co/nvidia/ https://huggingface.co/nvidia/canary-qwen-2.5b https://huggingface.co/spaces/hf-audio/open https://huggingface.co/spaces/hf-audio/open_asr_leaderboard" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10565v1" updated="2026-06-09T08:29:16Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10581v1" updated="2026-06-09T08:45:52Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10591v1" updated="2026-06-09T08:55:47Z" repo_found_in="pdf" repo_urls="https://github.com/timmahrt/praatIO" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10627v1" updated="2026-06-09T09:28:46Z" repo_found_in="" repo_urls="" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10791v1" updated="2026-06-09T12:42:14Z" repo_found_in="pdf" repo_urls="https://xuepingzhang.github.io/CompSpoof-V2-Dataset/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10908v1" updated="2026-06-09T14:20:05Z" repo_found_in="pdf" repo_urls="https://github.com/Security-FIT/RAT" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10911v1" updated="2026-06-09T14:20:55Z" repo_found_in="pdf" repo_urls="https://security-fit.github.io/deepfake_speech_ https://security-fit.github.io/deepfake_speech_datasets_app/" />
    <code-available-feed:article url="https://arxiv.org/abs/2606.10912v1" updated="2026-06-09T14:21:45Z" repo_found_in="pdf" repo_urls="https://github.com/Security-FIT/IG_for_SSL_detectors" />
  </code-available-feed:processed>
</feed>