Vitaliy Liptchinsky

Vitaliy Liptchinsky

Mountain View, California, United States
1K followers 500+ connections

Activity

Join now to see all activity

Experience & Education

  • General Motors

View Vitaliy’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

  • Self-supervised pretraining of visual features in the wild

    Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset. In this work, we explore if self-supervision lives to its expectation by training large models on random, uncurated images with no supervision. Our…

    Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset. In this work, we explore if self-supervision lives to its expectation by training large models on random, uncurated images with no supervision. Our final SElf-supERvised (SEER) model, a RegNetY with 1.3B parameters trained on 1B random images with 512 GPUs achieves 84.2% top-1 accuracy, surpassing the best self-supervised pretrained model by 1% and confirming that self-supervised learning works in a real world setting. Interestingly, we also observe that self-supervised models are good few-shot learners achieving 77.9% top-1 with access to only 10% of ImageNet. Code: https://github.com/facebookresearch/vissl

    Other authors
    See publication
  • Beyond english-centric multilingual machine translation

    The Journal of Machine Learning Research

    Existing work in translation demonstrated the potential of massively multilingual machine
    translation by training a single model able to translate between any pair of languages.
    However, much of this work is English-Centric, training only on data which was translated
    from or to English. While this is supported by large sources of training data, it does not reflect
    translation needs worldwide. In this work, we create a true Many-to-Many multilingual
    translation model that can…

    Existing work in translation demonstrated the potential of massively multilingual machine
    translation by training a single model able to translate between any pair of languages.
    However, much of this work is English-Centric, training only on data which was translated
    from or to English. While this is supported by large sources of training data, it does not reflect
    translation needs worldwide. In this work, we create a true Many-to-Many multilingual
    translation model that can translate directly between any pair of 100 languages. We build
    and open-source a training data set that covers thousands of language directions with parallel
    data, created through large-scale mining. Then, we explore how to effectively increase model
    capacity through a combination of dense scaling and language-specific sparse parameters to
    create high quality models. Our focus on non-English-Centric models brings gains of more
    than 10 BLEU when directly translating between non-English directions while performing
    competitively to the best single systems from the Workshop on Machine Translation (WMT).
    We open-source our scripts so that others may reproduce the data, evaluation, and final m2m100 model: https://github.com/pytorch/fairseq/tree/master/examples/m2m_100.

    Other authors
    See publication
  • End-to-end asr: from supervised to semi-supervised learning with modern architectures

    We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance with the supervised dataset alone, semi-supervision improves all models across architectures and…

    We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance with the supervised dataset alone, semi-supervision improves all models across architectures and loss functions and bridges much of the performance gaps between them. In doing so, we reach a new state-of-the-art for end-to-end acoustic models decoded with an external language model in the standard supervised learning setting, and a new absolute state-of-the-art with semi-supervised training. Finally, we study the effect of leveraging different amounts of unlabeled audio, propose several ways of evaluating the characteristics of unlabeled audio which improve acoustic modeling, and show that acoustic models trained with more audio rely less on external language models.

    Other authors
    See publication
  • LIBRI-LIGHT: A BENCHMARK FOR ASR WITH LIMITED OR NO SUPERVISION

    ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and
    genre descriptions. Additionally, we provide baseline systems and evaluation…

    We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and
    genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semisupervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are
    evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art

    Other authors
    See publication
  • Wav2letter++: A fast open-source speech recognition system

    ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    This paper introduces wav2letter++, a fast open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. We explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2× faster than other optimized frameworks for training end-to-end neural networks for speech recognition. We also show…

    This paper introduces wav2letter++, a fast open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. We explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2× faster than other optimized frameworks for training end-to-end neural networks for speech recognition. We also show that wav2letter++ training times scale linearly to 64 GPUs, the most we tested, for models with 100 million parameters. High-performance frameworks enable fast iteration, which is often a crucial factor in successful research and model tuning on new datasets and tasks.

    Other authors
    See publication
  • Letter-based speech recognition with gated convnets

    In the recent literature, "end-to-end" speech systems often refer to letter-based acoustic models trained in a sequence-to-sequence manner, either via a recurrent model or via a structured output learning approach (such as CTC). In contrast to traditional phone (or senone)-based approaches, these "end-to-end'' approaches alleviate the need of word pronunciation modeling, and do not require a "forced alignment" step at training time. Phone-based approaches remain however state of the art on…

    In the recent literature, "end-to-end" speech systems often refer to letter-based acoustic models trained in a sequence-to-sequence manner, either via a recurrent model or via a structured output learning approach (such as CTC). In contrast to traditional phone (or senone)-based approaches, these "end-to-end'' approaches alleviate the need of word pronunciation modeling, and do not require a "forced alignment" step at training time. Phone-based approaches remain however state of the art on classical benchmarks. In this paper, we propose a letter-based speech recognition system, leveraging a ConvNet acoustic model. Key ingredients of the ConvNet are Gated Linear Units and high dropout. The ConvNet is trained to map audio sequences to their corresponding letter transcriptions, either via a classical CTC approach, or via a recent variant called ASG. Coupled with a simple decoder at inference time, our system matches the best existing letter-based systems on WSJ (in word error rate), and shows near state of the art performance on LibriSpeech.

    Other authors
    See publication
  • Expressive Languages for Selecting Groups from Graph-Structured Data

    ACM

    Many query languages for graph-structured data are based
    on regular path expressions, which describe relations among
    pairs of nodes. We propose an extension that allows to re-
    trieve groups of nodes based on group structural character-
    istics and relations to other nodes or groups. It allows to ex-
    press group selection queries in a concise and natural style,
    and can be integrated into any query language based on
    regular path queries. We present an e?cient algorithm…

    Many query languages for graph-structured data are based
    on regular path expressions, which describe relations among
    pairs of nodes. We propose an extension that allows to re-
    trieve groups of nodes based on group structural character-
    istics and relations to other nodes or groups. It allows to ex-
    press group selection queries in a concise and natural style,
    and can be integrated into any query language based on
    regular path queries. We present an e?cient algorithm for
    evaluating group queries in polynomial time from an input
    data graph. Evaluations using real-world social networks
    demonstrate the practical feasibility of our approach.

    Other authors
    See publication
  • Statelets: Coordination of Social Collaboration Processes

    Springer Verlag

    Coordination language

    Other authors
    See publication

Languages

  • English

    Native or bilingual proficiency

  • German

    Limited working proficiency

  • Russian

    Native or bilingual proficiency

  • Ukrainian

    Native or bilingual proficiency

Recommendations received

More activity by Vitaliy

View Vitaliy’s full profile

  • See who you know in common
  • Get introduced
  • Contact Vitaliy directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Add new skills with these courses