Abstract
In recent years, transformers, initially developed for language, have been successfully applied to visual tasks. Vision transformers have been shown to push the state of the art in a wide range of tasks, including image classification, object detection, and semantic segmentation. While ample research has shown promising results in art attribution and art authentication tasks using convolutional neural networks, this paper examines whether the superiority of vision transformers extends to art authentication, improving, thus, the reliability of computer-based authentication of artworks. Using a carefully compiled dataset of authentic paintings by Vincent van Gogh and two contrast datasets, we compare the art authentication performances of Swin transformers with those of EfficientNet. Using a standard contrast set containing imitations and proxies (works by painters with styles closely related to van Gogh), we find that EfficientNet achieves the best performance overall. With a contrast set that only consists of imitations, we find the Swin transformer to be superior to EfficientNet by achieving an authentication accuracy of over 85%. These results lead us to conclude that vision transformers represent a strong and promising contender in art authentication, particularly in enhancing the computer-based ability to detect artistic imitations.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Art attribution and authentication
Art attribution and art authentication are two significant tasks in the cultural-heritage domain. The former involves identifying the creator of an artwork, while the latter aims to verify whether the artwork was indeed crafted by the presumed artist. These tasks are important because they directly impact the economic and cultural value of artworks [1]. Art attribution entails analyzing various aspects of an artwork, such as its style, materials, and subject, in order to determine the artist responsible for its creation. This can be a challenging task, especially for older works of art where information may be scarce or multiple artists might have been involved. Art authentication involves comprehensive scientific analysis, including the examination of pigments, canvas, paint application techniques, and the historical context [2].
1.1 Computer-based art attribution and authentication
With the rise of digital technology, computer-based visual analysis of artworks has provided a new tool to support art attribution and authentication. Computer-based methods for art attribution and authentication date back to the turn of the millennium, with the first ‘visual stylometry’ efforts [3]. These works are characterized by the development of ad hoc feature extraction methods (including fractal analysis, wavelet coefficients, and edge detection) to represent visual artistic features such as brushstrokes, followed by a machine learning model trained on such features to distinguish the works of the artist from possibly similar works by other artists [4,5,6,7,8]. The excellent pattern-recognition abilities of convolutional neural networks (CNNs) have led to a new wave of studies showing impressive performances on art-classification tasks [9,10,11] and many other visual tasks [12,13,14]. These studies involve complex CNN architectures that are trained on large digitized art collections, generally adding to the CNN a last dense (fully connected) layer. The last layer feeds into a single-output neuron in case of art authentication or into N output neurons for art attribution to one of N artists [15].
It should be acknowledged that computer-based art attribution and authentication are not without their limitations and challenges. The first group of limitations stems from the digital nature of the images used in this technique. These images might have deformations and loss of information because of factors such as image resolution, lighting conditions, camera type, and post-processing compression rate. The second group of limitations pertains to connoisseurship. Previous works [16, 17] have discussed the role of the machine as a new type of art expert responsible for attributions and authentications. Bell and Offert [16] have highlighted important similarities between human and machine connoisseur approaches, such as knowledge of numerous works by the same artist and related works. However, there are noteworthy differences that constitute limitations of the computer-based techniques. While the computer relies solely on optical information (images), the human connoisseur also considers contextual information, including but not limited to historical knowledge, provenance, and scientific results.
While most of the early studies have primarily focused on traditional machine learning for art-attribution tasks [5, 6, 18, 19], our paper delves into the more specific task of art authentication, using Vincent van Gogh as a case study. Our paper aims to perform a comparative evaluation of vision transformers (ViTs) [20,21,22] and CNNs on the art-authentication task and determine the level of performance that can be attained on this challenging task.
1.2 Selection of architectures
As we are interested in a comparison between previous state-of-the-art CNN-based methods and ViTs, we have to select representatives of both types of methods. To select a CNN architecture, we determine the best-performing architecture on art-classification tasks by relying on a sample of studies performed over the last 10 years. Although the selected studies have been performed with different methods and datasets, and mostly focused on art attribution instead of art authentication, their performances provide a clear sign of the best-performing architecture. Table 1 lists the performances and performance measures for five representative studies over the last 10 years. The performances fall within a limited range, \(78{-}91\%\). The best-performing study of Table 1 [11] made use of the ResNet101 architecture [23].
Hence, we select ResNet101 as one of the CNN architectures for our experiments. As will be motivated in Sect. 3.3, we include another CNN architecture called EfficientNet [24] in our selection. For the ViTs we will rely on two variants of a state-of-the-art architecture called Swin transformer [21].
The outline of the rest of the paper is as follows. Section 2 reviews CNNs and ViTs, highlighting the architectures used in this study, ResNet101, EfficientNet, and the Swin transformer. Section 3 details the experimental procedure, and Sect. 4 presents the results. Section 5 ends the paper with a conclusion and discussion of future work.
2 Convolutional neural networks and vision transformers
Convolutional neural networks gained considerable popularity with the 2012 release of AlexNet, which largely outperformed all previous models at the ILSVRC ImageNet Challenge 2012 [12, 25, 26]. This popularity was further solidified by a continuous stream of improved architectures and layers, most notably InceptionV3, VGG, and ResNet, and current state-of-the-art models such as EfficientNet [23, 24, 27, 28].
ResNet and EfficientNet are two of the most successful CNNs. ResNet, as described by He et al. [23], is a (potentially) extremely deep CNN. In contrast to a standard CNN, where each stage learns a function F(x) based on input x, ResNet stages learn the residual function \(F(x) = H(x)-x\) by using skip connections. The use of skip connections allows ResNets to excel due to their increased depth. EfficientNet represents a class of CNN models introduced by Tan and Le [24]. These models are optimized by scaling the width, depth, and input resolution of CNNs with a fixed ratio. EfficientNets have demonstrated superior performance compared to ResNets on image classification tasks. In our experiments, we use the variants ResNet-101 and EfficientNetB5.
Vision transformers are relatively new deep learning architectures that have gained considerable attention and popularity in the computer vision community [20]. They represent a departure from traditional CNNs by replacing the typical convolutional layers with attention mechanisms [29]. In linguistic tasks, the introduction of an attention mechanism facilitated the encoding of long-range contextual information, which led to exceptional results on a wide range of tasks [30]. Recent breakthrough performances of GPT4 [31] and related large language models are due to the power of transformers. One of the main advantages of ViTs is their ability to capture relatively long-range dependencies within an image, which is essential for a wide range of computer vision tasks. This is achieved through the attention mechanism, which allows the model to attend to any region of the image when making predictions, rather than being limited to a fixed image context, like CNNs. ViTs have achieved state-of-the-art results on several image classification benchmarks, including ImageNet, and have shown promising results on other tasks, such as object detection and semantic segmentation.
The Swin transformer was recently proposed as a generic transformer-based backbone for computer vision [21, 22]. The basic architecture is hierarchical and employs an efficient self-attention mechanism using shifting windows. Its hierarchical architecture allows for capturing multi-scale relations, and its shifting windows mitigate the growth of computational complexity with image size. Figure 1 illustrates a four-stage Swin transformer, the so-called Swin-Tiny variant. The input comprises an image of size \(H \times W \times 3\) which is partitioned into patches of size \(W/4 \times H/4 \times 3\) (the rectangle labeled "Patch Partition"). Each patch is embedded into a "token" of size \(H/4 \times W/4 \times C\) by means of a linear layer ("Linear Embedding"), where C is an arbitrary dimensionality parameter of the Swin architecture. The token is fed into the building block of the Swin transformer ("SWIN Transformer Pair"), the inner structure of which is illustrated in Fig. 2. The first block consists of layer normalization, multi-head attention, layer normalization, and two multilayer perceptrons. The multi-head (self-)attention is applied within non-overlapping \(M \times M\) windows of the input token (\(M = 7\)). The curved arrows represent skip connections. The second block is identical to the first, but applies attention to shifted \(M \times M\) windows.
To create a hierarchical “pyramid-like” representation, in the second stage of Fig. 1 the “patch merging” concatenates all values of each non-overlapping \(2 \times 2 \times C\) region into \(1 \times 1 \times 4C\) values and uses a linear layer to map these onto 2C values. As a result, the \(H/4 \times W/4 \times C\) output of the first stage is transformed into a patch-merged output of dimensions \(H/8 \times W/8 \times 2C\). The third stage performs the same steps as the second stage, but applies three Swin transformer pairs instead of one. Finally, in the fourth stage, patch merging is combined with a single Swin transformer pair. In our experiments, the output of the fourth Swin transformer pair is average pooled and submitted to a binary classifier.
Apart from the Swin-Tiny variant, three larger variants, which differ from Swin Tiny in the value of the dimensionality parameter C and in the number of Swin Pairs in the third stage N, have been proposed. In our experiments, we use the Swin-Tiny (\(C = 96\), \(N = 3\)) and Swin-base (\(C = 128\), \(N = 9\)) variants.
3 Experiments
This section specifies the experiments by discussing the van Gogh dataset Sect. 3.1, the data preparation and augmentation Sect. 3.2, the specific CNN and Swin architectures and their hyperparameter settings Sect. 3.3, and our evaluation procedure Sect. 3.4.
3.1 Van Gogh dataset
Our dataset for the authentication task was carefully collected and consists of 654 images of authentic paintings (authentic set) and 669 or 137 images of non-authentic ones (depending on the type of contrast set). The resolutions of the images of artworks vary from one reproduction to another. In what follows, we outline the authentic set and two versions of the contrast set: the “standard contrast set” and the “refined contrast” set. As will be described in Sect. 4, the development of the refined contrast is motivated by the results on the standard contrast set, which reveal that art authentication requires a more constrained selection of artworks in the contrast set. The composition of each contrast set is described below.
3.1.1 Authentic set
When compiling our authentic set, we have used the standard ’La Faille’ Catalogue Raisonné [32] as a reference, meaning that all authentic images used for training are recorded there. Moreover, we have removed from the authentic set the images whose authenticity is questioned by contemporary experts. This approach enables us to mitigate the risk of accidentally introducing fake artworks into the original dataset (label noise). The careful crafting of the authentic set distinguishes this work from previous ones, which are usually trained on images downloaded from WikiArt [33] (a less reliable source as compared to the established Catalogue Raisonné).
3.1.2 Contrast set
As art authentication involves a binary classification task, we carefully compile a second set that serves as a contrast to the authentic works. This secondary set consists of negative examples, i.e., artworks that are not attributed to van Gogh.
3.1.3 Standard contrast set
The standard contrast set features 69 imitations: 10 copies by followers of van Gogh such as Vik Muniz, Blanche Derousse, and Jamini Roy; 40 imitations in van Gogh’s style; and 21 known forgeries, including 8 produced by the famous forger Wacker [34, 35]. In addition, to achieve a balance with the authentic set, the standard contrast set also incorporates 600 proxies which are paintings by contemporary artists who utilized techniques and styles similar to those of van Gogh—mainly Post-Impressionism, Cloisonnism, and Japonism. The main proxy artists are Paul Cézanne (114 images), Henri de Toulouse-Lautrec (48 images), Maurice Prendergast (47 images), and Henri Matisse (47 images).
3.1.4 Refined contrast set
Including proxies in the standard contrast set introduces painting styles that differ greatly from those of van Gogh. Hence, for the construction of our refined contrast set, we remove all proxies and gather additional imitations from auction archives. We include 68 additional images that were cataloged as being inspired by van Gogh: 50 images are described as After Vincent van Gogh, 14 are in Manner of Vincent van Gogh, 2 are Attributed to Vincent van Gogh, 1 is Circle Vincent van Gogh, and 1 is Follower Vincent van Gogh. Table 2 shows the composition of the refined contrast set relative to the standard contrast set.
3.2 Data preparation
The dataset consists of sub-images of paintings, i.e., RGB images normalized to a fixed size of \(256 \times 256\) pixels, and the channel values normalized to the unit interval. The sub-images are created by dividing the whole image into \(2^p \times 2^p\) equally sized units, with p depending on the resolution of the original image as follows: \(p=2\), if the smaller side of an image is larger than 1024 pixels, and \(p=1\), if the smaller side is larger than 512 pixels and smaller than 1024. For all images, regardless of the resolution, we also include the sub-image of the center-cropped square stemming from the full image. Figure 3 exemplifies the generation of 16 squared, center-cropped patches from an authentic van Gogh painting. This patching method allows the models to extract very fine-grained brushstroke-level information from the smaller patches, but also more compositional and representational features from the full patch and the larger patches. Some of the examined architectures require an input size of \(224 \times 224\) pixels. In that case the original \(256 \times 256\) sub-images were downsampled using bicubic resampling.
To emphasize the importance of imitations over proxies, in the standard contrast set, we assign sample weights \(w_{im}\) to the imitations. In preliminary experiments, we found that \(w_{im} = 10\), showing that imitations weight ten times more than proxies, yields the best results. This value will remain consistent across the experiments conducted using the standard contrast set described in this study. In the refined contrast set, we did not employ sample weighting, setting \(w_{im}=1\).
We evaluate each model in \(N=20\) experiments. In each experiment, we randomly assign the paintings including their constituent patches, to the training, validation, and test partitions. These random assignments result in N training, validation, and test partitions. Each model is trained and evaluated on exactly the same N partitions. This ensures that each architecture is trained and evaluated in the same manner, which enables a fair comparative evaluation. Table 3 lists the compositions of the partitions in terms of the number of images for the authentic and contrast sets. In each experiment, a randomly selected subset of authentic images of approximately the same size as the size of contrast images is used for training.
Because we subdivide each image into patches, the actual number of patches in each partition is much larger. For instance, for the experiments with the standard contrast set, the actual numbers vary slightly (because images differ in their number of patches): About fifteen thousand patches in the training set and two thousand patches in the validation and test partition each. We emphasize that all patches of each painting are always assigned to the same partition. As a consequence, the test set always consists of patches that were not part of the training or validation partitions.
3.3 Architectures and training procedure
The recent outstanding art-classification results reported by Dobbs and Ras [11], as discussed in Sect. 1.2, have led us to choose ResNet101, the 101-layer version of ResNet [23], as a representative CNN for our van Gogh authentication task. Although ResNet101 represents the state of the art in art classification, it may not be the most robust CNN available. Therefore, to provide a more comprehensive evaluation, we include another CNN in our analysis that better represents the class of modern CNNs: EfficientNet [24]. Specifically, we select EfficientNetB5, as its complexity (measured by the number of parameters) roughly matches that of the simplest Swin transformer. For our experiments, we utilize two variations of Swin transformers (Swin Tiny and Swin Base), with the detailed description of the Swin transformer architecture provided in Sect. 2.
Using the standard contrast set, the four architectures examined are: EfficientNetB5, ResNet101, Swin-Tiny, and a larger version called Swin-Base. The latter is included to determine the potential beneficial effect of this larger Swin transformer variant. EfficientNetB5 has 28 M parameters, ResNet101 has 44.7M parameters, and Swin-Tiny and Swin-Base have 28 M and 88 M parameters, respectively. All architectures are pretrained on ImageNet [25]. ResNet101, EfficientNetB5, and Swin-Tiny are pre-trained on the 1K version of ImageNet, whereas Swin-Base is pre-trained on the 22K version of ImageNet.
In preliminary experiments we explored three variants of transfer learning: (i) freezing the base architecture and training a new top layer (the standard method of transfer learning), (ii) initially freezing the base, training the new top layer, and subsequently training the base and top with a small learning rate, and (iii) unfreezing all layers and training the entire architecture with a small learning rate. It turned out that variant (iii) gave the best results for all architectures, which is in line with previous findings in art classification [15, 36]. Hence, in contrast to what is typical to transfer learning, we employed variant (iii), where the top was defined as a randomly initialized dense layer. For the initialization, we use “He normal” initialization [37] that ensures that the random weight values do not saturate the receiving neurons’ activations. To this end, the \(w_n\) values of the weights feeding into a neuron are drawn from a (truncated) normal distribution with \(\mu = 0\) and \(\sigma = \sqrt{(}2/w_n)\). For all architectures, training is performed with binary cross-entropy as loss function, the Adam optimizer, batch size 32, learning rate 0.0001, early stopping (patience \(= 20\) epochs and minimum delta \(= 0.001\)), and imitation-sample weights \(w_{im}=10\).
For the experiments with the refined contrast set, we apply the same training procedure but do not use imitation-sample weights and restrict ourselves to EfficientNetB5 and Swin-Tiny. The motivation for focusing on these two architectures is twofold: (i) both architectures perform best in the experiments with the standard contrast set, and (ii) comparing the performances of these architectures is fair due to their almost equal parameter complexity.
3.4 Evaluation procedure
For each architecture, we performed \(N=20\) experiments and report the average prediction accuracies for individual patches and for the entire paintings. The latter is determined for each artwork by taking the mean of the predictions of its constituent patches, including the sub-image with a center-cropped square stemming from the full image. To further understand the model’s performance, we present accuracies per class, distinguishing between the authentic and the contrast classes. Additionally, within the contrast set, we provide separate accuracies for proxies and imitations.
4 Results
In this section, we present separately the results for the experiments conducted with both the standard contrast set and the refined contrast set.
4.1 Results for the standard contrast set
Table 4 reports the results obtained with the standard contrast set. For each of the examined architectures (with the pretraining variants mentioned in Sect. 3.3), it lists the mean accuracy for the patches and the entire paintings, as well as the number of parameters for each architecture.
From these results, we draw three observations. The main observation is that EfficientNetB5 yields the best art-authentication performance, both on patches and on entire images. We reiterate that in terms of the number of parameters, EfficientNetB5 has roughly the same complexity and initialization as Swin-Tiny (i.e., 28 M and ImageNet 1K, respectively), which makes it a fair comparison. The second observation is that both Swin architectures yield a considerable improvement in performance regarding ResNet101, i.e., accuracies \(\approx 0.89-0.90\) roughly matching the performances listed in Table 1 in Sect. 1.2. The third observation is that although the Swin-Base transformer performs marginally better than the Swin-Tiny transformer on patches, it does not result in a better performance on paintings. At first sight, these results suggest EfficientNetB5 outperforms ResNet101 and the Swin transformers on art-authentication tasks. However, a closer examination of the results for the constituents of the standard contrast set leads to a different view.
Table 5 lists the accuracies for the authentic and standard contrast sets, as well as the two constituent types of contrast artworks: imitations and proxies. The results show the performances obtained by all architectures mainly reflect a successful separation of authentic paintings and artworks by proxies, given that both have accuracies of more than \(90\%\). On the other side, the performance on the imitations is considerably lower, despite the use of sample weights. We acknowledge that the task of distinguishing imitations from originals is a much more complex and fine-grained one, than distinguishing proxies from originals. Proxies are artworks created by known artists in their own style, albeit similar to the style of van Gogh, while the imitations (including copies and forgeries) contain only artworks that were created, explicitly or implicitly in the style of van Gogh, with a clear and close emulation of the artist. Thus, this last category contains artworks with a much higher degree of similarity to the authentic ones.
Clearly, art authentication requires a fine distinction between imitations and authentic art. Hence, the poor performance on the imitations motivated the development of the refined contrast set. The results of our experiments with the refined contrast set are the subject of the next section.
4.2 Results for the refined contrast set
As mentioned in Sect. 3.3, we trained the two comparable architectures on the van Gogh dataset by using a refined contrast set which only comprises imitations. We did not use sample weights for these experiments. The obtained results are presented in Tables 6 and 7.
Table 6 shows the accuracies for paintings in the authentic and refined contrast sets. We observe that in this case, a much better balance is achieved between the performances on the authentic and contrast artworks. This applies especially to Swin-Tiny, which outperforms EfficientNetB5 and achieves the best overall performance. The much-improved performance on the imitations also suggests the relative improvement of the second dataset with respect to the first, as this second dataset tackles best the core of art authentication: the separation between authentic works and reproductions. Alongside this reasoning, the superior performance of Swin-Tiny on this second dataset suggests a non-negligible improvement over state-of-the-art CNNs.
Table 7 lists the mean accuracy, precision, and recall for EfficientNetB5 and Swin-Tiny. The latter scores best on all three metrics, showing that Swin-Tiny exhibits the best painting-based authentication performance.
Table 8 provides insight into the degree of overlap of patch predictions made by both architectures. The confusion table shows the percentages of patches predicted correctly and incorrectly by both architectures. While they agree on the majority of correctly classified patches (79%), their disagreement is limited to smaller percentages (7.4% and 5.7%). Both architectures incorrectly classify a slightly larger percentage (8.2%) of patches. The differences in correctly predicted artworks suggest that avenues combining the strengths of both models may yield even better performance. In this sense, art authentication may benefit a little from a hybrid CNN-ViT approach that combines the strengths of both architectures.
Figure 4 illustrates the differences between both architectures in terms of the distributions of their patch predictions. The histograms for EfficientNetB5 and Swin Tiny are shown in the left and right columns, respectively. The top row displays the incorrect patch predictions, and the bottom row the correct ones. The top left histogram shows a relatively large number of occurrences of wrong predictions in the interval 0.5-0.7 for EfficientNetB5, i.e., the first peak right from the middle. These indicate false positives, revealing a bias toward classifying patches as authentic. Such a peak is not evident for Swin Tiny, although there are more "confident" false predictions at 0 and 1 (see the top right histogram). Comparing the bottom two histograms, showing the correct predictions, it is clear that Swin Tiny (right histogram) has a much larger number of very confident predictions (near 0 and 1), than EfficientNetB5. These illustrations reveal the subtle ways in which both types of architectures (CNN and vision transformer) differ in the realization of their predictions. To what extent these differences are algorithm-specific is unclear and subject to further investigations.
5 Conclusion and future work
We performed a comparative evaluation of CNNs and vision transformers. We found EfficientNetB5 outperforms the Swin-Tiny and Swin-Base transformers on the standard contrast set, by favoring the classifying of proxies over the classifying of imitations. In our example, this shows that EfficientNetB5 is better able to distinguish between van Gogh and his contemporaries than both Swin transformers. The Swin-Tiny transformer was shown to be marginally superior to EfficientNetB5 on a refined contrast set (containing imitations only) that better reflects the essence of art authentication. For the Swin-Tiny transformer, the change in contrast set was associated with a jump in imitation–classification accuracy from 0.53 for the standard contrast set to 0.84 on the refined contrast set.
While further tests should be carried out to determine the generalizability of these results to other artists’ datasets, we also highlight how the deep learning approach to art authentication has an inherent superiority in terms of generalizability to all feature engineering approaches mentioned in Sect. 1.1, as they require little hyperparameter tuning and do not rely on an isolated feature (i.e., brushstroke) which may not be visible in all artists.
Our results lead us to conclude that visual backbones based on vision transformers are at least as viable for art authentication as CNNs and that their predictions largely overlap. In our future work, we will further explore how vision transformers realize their advantage and determine to what extent recently proposed improvements to the Swin transformer, i.e., the cross-shaped window transformer [38], lead to further improvements on the task of art authentication.
Future work should also address the limitations arising from the digital nature of the training images. We stress the importance of developing methodologies that can achieve invariance to different camera acquisitions, resolutions, and scales. Additionally, an interesting line of research could explore incorporating contextual information into the models, potentially leveraging multi-modality and textual guidance.
Data availability statement
The datasets generated during and/or analyzed during the current study are not publicly available due to licensing constraints, but are available from the corresponding author on reasonable request.
References
Spencer RD (2004) The expert versus the object: judging fakes and false attributions in the visual arts. Oxford University Press, Oxford
Sloggett R (2019) Unmasking art forgery: scientific approaches In: Hufnagel S, Chappell D (eds). The Palgrave handbook on art crime. Springer, London, pp 381–406
Postma EO, Herik HJvd (2000) Discovering the visual signature of painters. Future directions for intelligent systems and information sciences. The future of speech and image technologies, brain computers, WWW, and Bioinformatics. Springer, Heidelberg, pp 129–147
Johnson CR, Hendriks E, Berezhnoy IJ, Brevdo E, Hughes SM, Daubechies I, Li J, Postma E, Wang JZ (2008) Image processing for artist identification. IEEE Signal Process Mag 25(4):37–48
Qi H, Taeb A, Hughes SM (2013) Visual stylometry using background selection and wavelet-HMT-based Fisher information distances for attribution and dating of impressionist paintings. Signal Process 93(3):541–553
Liu H, Chan RH, Yao Y (2016) Geometric tight frame based stylometry for art authentication of van gogh paintings. Appl Comput Harmon Anal 41(2):590–602
Li J, Yao L, Hendriks E, Wang JZ (2011) Rhythmic brushstrokes distinguish van gogh from his contemporaries: findings via automated brushstroke extraction. IEEE Trans Pattern Anal Mach Intell 34(6):1159–1176
Taylor RP, Micolich AP, Jonas D (1999) Fractal analysis of Pollock’s drip paintings. Nature 399(6735):422–422
van Noord N, Hendriks E, Postma E (2015) Toward discovery of the artist’s style: learning to recognize artists by their artworks. IEEE Signal Process Mag 32(4):46–54
van Noord N, Postma E (2017) Learning scale-variant and scale-invariant features for deep image classification. Pattern Recogn 61:583–592
Dobbs T, Ras Z (2022) On art authentication and the Rijksmuseum challenge: a residual neural network approach. Expert Syst Appl 116933
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT press, Cambridge, MA
Amelio A, Bonifazi G, Corradini E, Di Saverio S, Marchetti M, Ursino D, Virgili L (2022) Defining a deep neural network ensemble for identifying fabric colors. Appl Soft Comput 130:109687
Corradini E, Porcino G, Scopelliti A, Ursino D, Virgili L (2022) Fine-tuning Salgan and Pathgan for extending saliency map and gaze path prediction from natural images to websites. Expert Syst Appl 191:116282. https://doi.org/10.1016/j.eswa.2021.116282
Cetinic E, Lipic T, Grgic S (2018) Fine-tuning convolutional neural networks for fine art classification. Expert Syst Appl 114:107–118
Bell P, Offert F (2021) Reflections on connoisseurship and computer vision. J Art Historiography (24)
Zhu Y, Ji Y, Zhang Y, Xu L, Zhou AL, Chan E (2019) Machine: the new art connoisseur. arXiv preprint arXiv:1911.10091
Lyu S, Rockmore D, Farid H (2004) A digital technique for art authentication. In: Proceedings of the National Academy of the U.S.A. 101(49), pp 17006–17010
Hughes JM, Graham DJ, Rockmore DN (2010) Quantification of artistic style through sparse coding analysis in the drawings of Pieter Bruegel the Elder. Proc Natl Acad Sci 107(4):1279–1283
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: 9th international conference on learning representations, ICLR 2021, Virtual Event, Austria, May 3–7
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)
Liu Z, Hu H, Lin Y, Yao Z, Xie Z, Wei Y, Ning J, Cao Y, Zhang Z, Dong L, Wei F, Guo B (2022) Swin transformer v2: scaling up capacity and resolution. In: CVPR 2022
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning, pp 6105–6114 . PMLR
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE. https://ieeexplore.ieee.org/abstract/document/5206848/
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30, pp 6000–6010
Lin T, Wang Y, Liu X, Qiu X (2022) A survey of transformers. AI Open 3:111–132
OpenAI: GPT-4 Technical report (2023)
de la Faille JB (1928) L’oeuvre de Vincent van Gogh: Catalogue Raisonné. Van Oest, Paris
David LO, Pedrini H, Dias Z, Rocha A (2021) Authentication of Vincent van Gogh’s work. In: International conference on computer analysis of images and patterns, pp 371–380 . Springer
Nelson MR (2011) Underneath the van Gogh F614. Chemmatters, 15
Feilchenfeldt W (1989) Van Gogh fakes: the Wacker affair, with an illustrated catalogue of the forgeries. Simiolus: Netherlands Q Hist Art 19(4):289–316
Gonthier N, Gousseau Y, Ladjal S (2021) An analysis of the transfer learning of convolutional neural networks for artistic images. In: Pattern recognition. ICPR international workshops and challenges: virtual event, January 10–15, 2021, Proceedings, Part III, pp 546–561. Springer
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. https://doi.org/10.48550/ARXIV.1502.01852. arXiv:abs/1502.01852
Dong X, Bao J, Chen D, Zhang W, Yu N, Yuan L, Chen D, Guo B (2022) CSwin transformer: a general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12124–12134
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Schaerf, L., Postma, E. & Popovici, C. Art authentication with vision transformers. Neural Comput & Applic 36, 11849–11858 (2024). https://doi.org/10.1007/s00521-023-08864-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08864-8