PiCoGen2: Piano cover generation with transfer learning approach and weakly aligned data

Chih-Pin Tan, Hsin Ai, Yi-Hsin Chang, Shuen-Huei Guan, Yi-Hsuan Yang
National Taiwan University, KKCompany
ISMIR 2024

Abstract

Piano cover generation aims to create a piano cover from a pop song. Existing approaches mainly employ supervised learning and the training demands strongly-aligned and paired song-to-piano data, which is built by remapping piano notes to song audio. This would, however, result in the loss of piano information and accordingly cause inconsistencies between the original and remapped piano versions. To overcome this limitation, we propose a transfer learning approach that pre-trains our model on piano-only data and fine-tunes it on weakly-aligned paired data constructed without note remapping. During pre-training, to guide the model to learn piano composition concepts instead of merely transcribing audio, we use an existing lead sheet transcription model as the encoder to extract high-level features from the piano recordings. The pre-trained model is then fine-tuned on the paired song-piano data to transfer the learned composition knowledge to the pop song domain. Our evaluation shows that this training strategy enables our model, named PiCoGen2, to attain high-quality results, outperforming baselines on both objective and subjective metrics across five pop genres.

PiCoGen

Strong aligment v.s. Weak alignment: Different from Pop2Piano, PiCoGen2 doesn't rely on aligning notes from a piano performance with the original song. Instead, it discovers the beat correspondence to establish the alignment between the original music and its cover. This figure illustrates: (a) Pop2Piano constructing strongly-aligned data by remapping piano notes to the song audio, altering the piano content, and (b) we using weakly-aligned song-to-piano pairs without note-remapping, leavning the piano content intact.

PiCoGen2

PiCoGen2 system diagram: PiCoGen2 is an end-to-end system that generates a piano cover directly from the input audio. While it still utilizes SheetSage as a feature extractor, this version doesn't sample outputs from latent embeddings. Instead, it takes SheetSage's final hidden state as the intermediate representation, which is passed to the decoder as a condition. In this figure, the fire and snowflake symbols indicate the trainable and frozen parts. For example, the parameters for SheetSage are always frozen.

Citation


@inproceedings{tan2024picogen2,
    author = {Tan, Chih-Pin and Ai, Hsin and Chang, Yi-Hsin and Guan, Shuen-Huei and Yang, Yi-Hsuan},
    title = {PiCoGen2: Piano cover generation with transfer learning approach and weakly aligned data},
    year = 2024,
    month = nov,
    booktitle = {Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR)},
    address = {San Francisco, CA, United States},
}