Ebook: ECAI 2024
The field of AI has grown enormously since 1974, when a summer conference on Artificial Intelligence and Simulation of Behaviour was held in Brighton, UK. This milestone in the history of AI has since come to be thought of as the 1st European Conference on Artificial Intelligence (ECAI).
This book presents the proceedings of ECAI-2024, the 27th European Conference on Artificial Intelligence. This 50th-anniversary edition of the conference, with the theme of celebrating the past and inspiring the future, was held from 19 to 24 October 2024 in Santiago de Compostela, Spain. Included here are 544 Main Track papers (selected from 2,344 submissions, acceptance rate 23%), and 18 Demo Track papers (57 submissions, acceptance rate 32%). These cover all areas of AI, including planning, search, constraints, satisfiability, knowledge representation, uncertainty, data mining, machine learning, natural language processing, computer vision, multiagent systems, and robotics. Multi-disciplinary aspects such as the interplay between humans and machines, fairness, ethics, and trust are also covered.
Also included in the book are eight invited papers on Frontiers in AI and the abstracts of 14 recently-published journal papers not previously presented at a conference with archival proceedings. ECAI-2024 was held concurrently with PAIS-2024, the 13th edition of the Conference on Prestigious Applications of Intelligent Systems, and the peer reviewed papers accepted for presentation at PAIS (35 papers from 71 submissions, acceptance rate 49%) also form part of this book.
Offering a comprehensive overview of recent research and developments in AI, the book will be of interest to all those working in the field.
In July 1974, a “Summer Conference on Artificial Intelligence and Simulation of Behaviour” organised by the AISB was held in Brighton. It featured the presentation of 21 contributed papers and three invited talks.
Although the name and acronym were not adopted until 1982, this important milestone in the history of AI is now thought of as the 1st European Conference on Artificial Intelligence (ECAI). The convention of thinking of the 1974 conference as the first ECAI dates back to 1984, when what the organisers called the “6th” ECAI was held in Pisa. This means that in 2024, we are celebrating the 50th anniversary of ECAI.
This volume represents the proceedings of the 50th-anniversary edition of ECAI, the 27th European Conference on Artificial Intelligence, held in Santiago de Compostela from the 19th to the 24th of October 2024, under the motto of “celebrating the past, inspiring the future”. ECAI-2024 is organised by the European Association for Artificial Intelligence (EurAI) and the Spanish Association for Artificial Intelligence (AEPIA), and hosted by the Research Centre on Intelligent Technologies of the University of Santiago de Compostela (CiTIUS).
This is not the first time that plans have been made for the ECAI to come to Santiago de Compostela. The 2020 edition was due to be hosted here, but had to be moved online owing to the Covid-19 pandemic. We are therefore looking forward all the more to welcoming the AI community of Europe and the world to Santiago, a city which has been famous for centuries as both a travel destination and a meeting place.
The field of AI has grown enormously since 1974. A total of 2,344 paper submissions were received for the main track of the conference (acceptance rate 23%), and a further 57 for the demo track (acceptance rate 32%). Submissions covered all areas of AI, including planning, search, constraints, satisfiability, knowledge representation, uncertainty, data mining, machine learning, natural language processing, computer vision, multiagent systems, and robotics. A significant number of submissions focused on multi- and interdisciplinary aspects, including the interplay between humans and machines, as well as matters relating to fairness, ethics, and trust.
Following a long tradition, this edition of ECAI will be held alongside the 13th edition of the Conference on Prestigious Applications of Intelligent Systems (PAIS), for which 71 submissions were received (acceptance rate 49%). PAIS provides an important forum for academic and industrial researchers and practitioners to share experiences and insights on the applicability, development, and deployment of intelligent systems.
This volume includes the peer-reviewed papers accepted for presentation at both ECAI and PAIS, as well as eight invited papers on Frontiers in AI. These invited papers correspond to short invited talks by members of the AI community currently doing particularly exciting and innovative work, the idea being to highlight important new results, techniques, and trends in our field. This volume also includes the abstracts of 14 papers recently published in one of the two leading discipline-wide journals in AI, namely Artificial Intelligence (AIJ) and the Journal of Artificial Intelligence Research (JAIR), but which have not previously been presented at a conference with archival proceedings.
We look forward to a rich scientific programme that is bound to stimulate fruitful discussion about recent progress, as well as the future of the field. Besides the presentation of the papers included in this volume, the programme will include a variety of panels and special sessions on topics of wider interest to the AI community. Plenary keynote addresses will be delivered by Iryna Gurevych, Marijn Heule, César A. Hidalgo, and Iolanda Leite. The PAIS invited talk will be given by Nirmalie Wiratunga, and the EurAI Dissertation Award talk by Andrea Campagner.
The main conference will be preceded by a similarly rich pre-conference programme, running throughout the weekend. This will include the doctoral consortium, providing PhD students in AI with an opportunity to meet and interact closely with their peers from across the globe, as well as a varied programme of satellite workshops and tutorials, a list of which is included in this volume.
Not only is ECAI one of the major scientific events in AI, it is also an event that aims to facilitate interaction between attendees, as well as introducing them to local culture, gastronomy, music, history, and architecture. This edition will be no exception, Santiago de Compostela being the capital of Galicia and home to one of the oldest universities in the world, founded in 1495.
Organising a large scientific conference such as ECAI is a community effort, and we owe a great debt of gratitude to many individuals and organisations. To start with, we would like to thank the members of the EurAI Board for their trust, and its president Carles Sierra for his frequent advice over the past year or so.
More than 2,800 members of the international AI community have contributed their time and expertise to the peer-review process which enabled us to select the very best work for presentation at the conference. You will find their names listed on the following pages. In addition, many senior members of the community have kindly provided their confidential advice on numerous occasions.
As chair of the demo track, Reyhan Aydoan coordinated the selection of demo papers, while Gregor Behnke took responsibility for the journal track. Noa Agmon and Michela Milano put together an exciting programme of workshops, and Gabriele Röger and Margret Keuper organised the selection of a diverse range of tutorials. The all-important doctoral consortium is being chaired by Fernando P. Santos and Anaëlle Wilczynski, while as diversity-and-inclusion chairs, Isabel Neto and Samia Touileb have been working towards creating extra opportunities for networking, starting new collaborations, and for the exchange of ideas in an informal setting. As publicity chair, Luis Magdalena helped greatly in publicising ECAI and its many initiatives. Pedro Larrañaga organised the Lunch with a EurAI Fellow programme, whereby PhD students are able to interact closely with some of the most senior representatives of our field. Simon Rey and Diogo Carvalho have done an outstanding job as workflow managers, implementing a variety of tools which have allowed the work done by the programme committee to scale to the level required for a conference of the size of ECAI. Finally, we are most grateful to the many members of the local team in Santiago de Compostela for many crucial contributions to the organisation of the conference.
Organising ECAI was made possible by the exceptional support we received from the governments of the city of Santiago de Compostela, the province of A Coruña, Galicia, and Spain, and the sponsorship of a good number of companies and organisations who partnered with us for this event. We sincerely thank all of them for their support.
Last but not least, we must thank the countless members of the AI community who submitted their work to ECAI and PAIS and who will be joining us in Santiago de Compostela to celebrate the 50th anniversary of ECAI. We look forward to celebrating with you!
Ulle Endriss, Francisco S. Melo, Kerstin Bach, Alberto Bugarín-Diz, José M. Alonso-Moral, Senén Barro, and Fredrik Heintz
September 2024
The concept of General-Purpose AI (GPAI) has recently been permeating research papers, policy reports and legal regulations, as a way of referring to current and future models with high levels of capability and generality. Yet precisely characterising GPAI models remains elusive. Current definitions often describe GPAI models as those that ‘competently perform a wide range of distinct tasks’. To properly characterise GPAI we need well-grounded definitions of capability and generality. In this paper, I will briefly introduce –or revisit– the concept of capability, going well beyond aggregate performance on benchmarks, and discuss practical procedures to evaluate the capability profile of AI systems, and derive generality metrics from them.
As technology widens our possibilities for communication and knowledge discovery, it can also leave us vulnerable to misinformation and toxic, abusive, manipulative content, which can cause substantial harm both individually and collectively. Combating the malicious use of social media implies solving very complex and subjective tasks that require a synergy between automated tools and humans. Existing ML models, trained on vast datasets, excel in many tasks like summarization, translation, and human-like content generation. However, we argue that in order for those tools to be effective, they need to be able to form the types of representations and semantic inferences that humans use and create, have the ability to combine automatically acquired data and insights with the domain-specific expertise of the users, and involve high-level reasoning, all three features in which knowledge-based and symbolic approaches to artificial intelligence seem to be well-fitted for.
The integration of business process management (BPM) with artificial intelligence (AI) is driving unprecedented advancements in the creation of trustworthy, intelligent information systems. On the one hand, BPM poses unconventional, relevant questions about processes and the event data produced during their execution. On the other hand, AI offers established techniques that must be adapted and further enhanced to address these questions effectively. This synergy is particularly impactful in the case of flexible processes, which are best represented using a declarative approach – emphasising the temporal constraints that must be respected by the process, instead of explicitly detailing all the acceptable flows of activities. In this article, we overview how automated reasoning and learning techniques for temporal logics on finite traces provide robust foundations for representing, mining, and synthesising declarative processes. We cover established results as well as frontier research in this area.
We consider the exact and probably approximately correct (PAC) learning frameworks from computational learning theory and discuss opportunities and challenges for applying notions developed within these frameworks to extract information from black-box machine learning models, in particular, from language models. We discuss recent works that consider algorithms designed for the exact and PAC frameworks to extract information in the format of automata, Horn theories, and ontologies from machine learning models and possible applications of these approaches for understanding the models, studying biases, and knowledge acquisition.
In recent years, the computational social choice community has increasingly studied the topic of proportional representation. This topic is particularly relevant for political elections, with many countries basing their voting systems on this principle, especially in Europe. However, the ideas behind proportional representation are also relevant in many other domains, including applications in artificial intelligence. We discuss a model of sequential decision making with proportional rules, and how it can be used for three applications: for merging the outputs of several large language models, for improving reinforcement learning from human feedback (RLHF), and for virtual democracy.
Most complex problems of social relevance, such as climate change mitigation, traffic management, taxation policy design, or infrastructure management, involve both multiple stakeholders and multiple potentially conflicting objectives. In a nutshell, the majority of real world problems are multi-agent and multi-objective in nature. Artificial intelligence (AI) is a pivotal tool in designing solutions for such critical domains that come with high impact and ramifications across many dimensions, from societal and economic well-being, to ethical, political, and legal levels. Given the current theoretical and algorithmic developments in AI, it is an opportune moment to take a holistic approach and design decision-support tools that: (i) tackle all the prominent challenges of such problems and consider both the multi-agent and multi-objective aspects; (ii) exhibit vital characteristics, such as explainability and transparency, in order to enhance user agency and alignment. These are the challenges that I will discuss during the Frontiers in AI session at ECAI 2024, together with a brief overview of my work and next steps for this field. This paper summarises my contribution to the session.
Currently, one of the predominant approaches for optimal classical planning is A* search with heuristics that partition action costs among several abstractions of the input planning task. One example of this approach is the Scorpion planner, which computes saturated cost partitionings over projections and Cartesian abstractions. Scorpion participated in the International Planning Competition 2023 and achieved the second place in the optimal track. It was only outperformed by the Ragnarok portfolio planner, which includes Scorpion as a component. In this invited paper for the ECAI Frontiers in AI series, we present the components of Scorpion and analyze their contributions to the overall performance in an ablation study. As a result, the paper serves as a short introduction to many of the techniques that are vital for state-of-the-art performance in optimal classical planning.
Today, both science and industry rely heavily on machine learning models, predominantly artificial neural networks, that become increasingly complex and demand more computing resources to be trained. In this paper, we will look holistically at the efficiency of machine learning models and draw the inspirations to address their main challenges from the green sustainable economy principles. Instead of constraining some computations or memory used by the models, we will focus on reusing what is available to them: computations done in the previous processing steps, partial information accessible at run-time, or knowledge gained by the model during previous training sessions in continually learned models. This new research path of zero-waste machine learning can lead to several research questions related to efficiency of contemporary neural networks - how machine learning models can learn better with less data? How they select relevant data samples out of many? Finally, how can they build on top of already trained models to reduce the need for more training samples? Here, we explore all the above questions and attempt to answer them.
The journal track of the 27th European Conference on Artificial Intelligence (ECAI-2024) offered the authors of papers recently accepted for publication by either one of the two leading discipline-wide journals in AI, Artificial Intelligence (AIJ) and the Journal of Artificial Intelligence Research (JAIR), the opportunity to present their work at the conference without undergoing an additional round of reviewing. Papers were eligible only if no part had previously been presented at a conference with archival proceedings. Traditionally, the authors of such papers would have missed out on the opportunity to present their work to a broader research audience. This limitation tends to discourage the submission of original work to journals without prior conference publications on the same topic. The intention of the journal track is to encourage a “journal-first” publication strategy–by giving authors the option to present their work at a suitable conference venue such as ECAI. On the following pages, for each paper presented at the journal track, we list bibliographic information and the abstract of the original publication.
Domain adaptation has been extensively explored in object detection. Through the utilization of self-training and the decoupling of adversarial feature learning from the training of the detector, current methods make detectors more transferable and ensure their discriminability. However, the presence of low-quality pseudo labels during self-training introduces noises to the training phase and thus degrades the model performance. To tackle this challenge, we introduce an I-adapt framework, whose IoU Adapter accurately predicts the Intersection over Union (IoU) between predicted boxes and their corresponding ground-truth boxes in both source and target domains. This enables an effective measure for the pseudo-label quality. Based on this measure, we propose a re-weighting strategy, which enforces the detector to focus on learning from high-quality pseudo labels. We achieve state-of-the-art (SOTA) performance in several cross-domain object detection tasks, proving the effectiveness of I-adapt.
The performance of domain adaptation technologies has not yet reached an ideal level in the current 3D object detection field for autonomous driving, which is mainly due to significant differences in the size of vehicles, as well as the environments they operate in when applied across domains. These factors together hinder the effective transfer and application of knowledge learned from specific datasets. Since the existing evaluation metrics are initially designed for evaluation on a single domain by calculating the 2D or 3D overlap between the prediction and ground-truth bounding boxes, they often suffer from the overfitting problem caused by the size differences among datasets. This raises a fundamental question related to the evaluation of the 3D object detection models’ cross-domain performance: Do we really need models to maintain excellent performance in their original 3D bounding boxes after being applied across domains? From a practical application perspective, one of our main focuses is actually on preventing collisions between vehicles and other obstacles, especially in cross-domain scenarios where correctly predicting the size of vehicles is much more difficult. In other words, as long as a model can accurately identify the closest surfaces to the ego vehicle, it is sufficient to effectively avoid obstacles. In this paper, we propose two metrics to measure 3D object detection models’ ability of detecting the closer surfaces to the sensor on the ego vehicle, which can be used to evaluate their cross-domain performance more comprehensively and reasonably. Furthermore, we propose a refinement head, named EdgeHead, to guide models to focus more on the learnable closer surfaces, which can greatly improve the cross-domain performance of existing models not only under our new metrics, but even also under the original BEV/3D metrics. Our code is available at unmapped: uri https://github.com/Galaxy-ZRX/EdgeHead.
Flowcharts are graphical tools for representing complex concepts in concise visual representations. This paper introduces the FlowLearn dataset, a resource tailored to enhance the understanding of flowcharts. FlowLearn contains complex scientific flowcharts and simulated flowcharts. The scientific subset contains 3,858 flowcharts sourced from scientific literature and the simulated subset contains 10,000 flowcharts created using a customizable script. The dataset is enriched with annotations for visual components, OCR, Mermaid code representation, and VQA question-answer pairs. Despite the proven capabilities of Large Vision-Language Models (LVLMs) in various visual understanding tasks, their effectiveness in decoding flowcharts—a crucial element of scientific communication—has yet to be thoroughly investigated. The FlowLearn test set is crafted to assess the performance of LVLMs in flowchart comprehension. Our study thoroughly evaluates state-of-the-art LVLMs, identifying existing limitations and establishing a foundation for future enhancements in this relatively underexplored domain. For instance, in tasks involving simulated flowcharts, GPT-4V achieved the highest accuracy (58%) in counting the number of nodes, while Claude recorded the highest accuracy (83%) in OCR tasks. Notably, no single model excels in all tasks within the FlowLearn framework, highlighting significant opportunities for further development.
Recent years have witnessed significant advances in image deraining tasks due to the emergence of numerous effective Transformers and multi-layer perceptron (MLP) models. However, these models still rely on unidirectional information flow and fail to fully exploit the potentially useful information from multiple image scales, thus limiting the robustness of the models in complex rainy scenes. To this end, we develop an effective closed-loop bidirectional scale-recurrent network (called CBS-Net) for image deraining, which organically integrates both Transformer and MLP models to jointly explore multi-scale rain representations. Specifically, we introduce a sparse Transformer block within the intra-scale branch to adaptively capture the most useful content-aware features. Furthermore, we construct a dimensional MLP block within the inter-scale branch to dynamically modulate spatial-aware features from different scales. To ensure more accurate bidirectional estimations in our scale-recurrent network, a simple yet effective feedback propagation block is embedded to perform coarse-to-fine and fine-to-coarse information communication. Extensive experimental results show that our approach achieves state-of-the-art performance on multiple benchmark datasets, demonstrating its effectiveness and scalability.
Pre-trained large-scale models have exhibited remarkable efficacy in computer vision, particularly for 2D image analysis. However, when it comes to 3D point clouds, the constrained accessibility of data, in contrast to the vast repositories of images, poses a challenge for the development of 3D pre-trained models. This paper therefore attempts to directly leverage pre-trained models with 2D prior knowledge to accomplish the tasks for 3D point cloud analysis. Accordingly, we propose the Adaptive PointFormer (APF), which fine-tunes pre-trained 2D models with only a modest number of parameters to directly process point clouds, obviating the need for mapping to images. Specifically, we convert raw point clouds into point embeddings for aligning dimensions with image tokens. Given the inherent disorder in point clouds, in contrast to the structured nature of images, we then sequence the point embeddings to optimize the utilization of 2D attention priors. To calibrate attention across 3D and 2D domains and reduce computational overhead, a trainable PointFormer with a limited number of parameters is subsequently concatenated to a frozen pre-trained image model. Extensive experiments on various benchmarks demonstrate the effectiveness of the proposed APF. The source code and more details are available at unmapped: ext-link https://vcc.tech/research/2024/PointFormer.
The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model’s expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. Additionally, we introduce deploy-friendly mechanisms such as Multiple groups of MLP (gMLP) Reparameterization, Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional Fusion Block (CFB) to further improve the model’s performance and reduce computational overhead. Our experiments demonstrate that FMViT surpasses existing CNNs, ViTs, and CNN-Transformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks. On the TensorRT platform, FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the ImageNet dataset while maintaining similar inference latency. Moreover, FMViT achieves comparable performance with EfficientNet-B5 but with a 43% improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6% in top-1 accuracy on the ImageNet dataset, with inference latency comparable to MobileOne (78.5% vs. 75.9%).
Recent advancements in privacy-preserving deep learning (PPDL) enable artificial intelligence-assisted (AI-assisted) medical image diagnostics with privacy guarantees, addressing increasing concerns about data and model privacy. However, intensive studies are restricted to shallow and narrow neural networks (NNs) for simple service (e.g., disease prediction), leaving a gap in exploring diverse inferences. This paper proposes TrustMIS, a trust-enhanced inference framework for fast and private medical image segmentation (MIS) and prediction services. Based on two-party computation, TrustMIS introduces lightweight additive secret-sharing tools to safeguard medical records and NNs. Complementing existing PPDL schemes, we present a series of secure two-party interactive protocols for linear layers. Specifically, we optimize the secure matrix multiplication by reducing the number of expensive multiplication operations with the help of free-computation addition operations to enhance efficiency (bringing 1.15× ∼2.64× savings in both time and communication costs). Furthermore, we customize a fresh secure transposed convolutional protocol for MIS-oriented NNs. A thorough theoretical analysis is provided to prove TrustMIS’s correctness and security. We conduct experimental evaluations over two benchmark and four real-world medical datasets and compare them to state-of-the-art studies. The results demonstrate TrustMIS’s superiority in efficiency and accuracy, improved by 1.1× ∼ 54.4× speedup in secure disease prediction, and 5.56% ↑ ∼ 11.7% ↑ accuracy in secure MIS.
Action recognition, an essential component of computer vision, plays a pivotal role in multiple applications. Despite significant improvements brought by Convolutional Neural Networks (CNNs), these models suffer performance declines when trained with discontinuous video frames, which is a frequent scenario in real-world settings. This decline primarily results from the loss of temporal continuity, which is crucial for understanding the semantics of human actions. To overcome this issue, we introduce the 4A (Action Animation-based Augmentation Approach) pipeline, which employs a series of sophisticated techniques: starting with 2D human pose estimation from RGB videos, followed by Quaternion-based Graph Convolution Network for joint orientation and trajectory prediction, and Dynamic Skeletal Interpolation for creating smoother, diversified actions using game engine technology. This innovative approach generates realistic animations in varied game environments, viewed from multiple viewpoints. In this way, our method effectively bridges the domain gap between virtual and real-world data. In experimental evaluations, the 4A pipeline achieves comparable or even superior performance to traditional training approaches using real-world data, while requiring only 10% of the original data volume. Additionally, our approach demonstrates enhanced performance on In-the-wild videos, marking a significant advancement in the field of action recognition. The full version of this paper, along with the code and data, can be found at [41].
3D human pose estimation is a vital task in computer vision, involving the prediction of human joint positions from images or videos to reconstruct a skeleton of a human in three-dimensional space. This technology is pivotal in various fields, including animation, security, human-computer interaction, and automotive safety, where it promotes both technological progress and enhanced human well-being. The advent of deep learning significantly advances the performance of 3D pose estimation by incorporating temporal information for predicting the spatial positions of human joints. However, traditional methods often fall short as they primarily focus on the spatial coordinates of joints and overlook the orientation and rotation of the connecting bones, which are crucial for a comprehensive understanding of human pose in 3D space. To address these limitations, we introduce Quater-GCN (Q-GCN), a directed graph convolutional network tailored to enhance pose estimation by orientation. Q-GCN excels by not only capturing the spatial dependencies among node joints through their coordinates but also integrating the dynamic context of bone rotations in 2D space. This approach enables a more sophisticated representation of human poses by also regressing the orientation of each bone in 3D space, moving beyond mere coordinate prediction. Furthermore, we complement our model with a semi-supervised training strategy that leverages unlabeled data, addressing the challenge of limited orientation ground truth data. Through comprehensive evaluations, Q-GCN has demonstrated outstanding performance against current state-of-the-art methods. The full version of this paper, along with the code and data, can be found at [39].
As a novel 3D scene representation, semantic occupancy has gained much attention in autonomous driving. However, existing occupancy prediction methods mainly focus on designing better occupancy representations, such as tri-perspective view or neural radiance fields, while ignoring the advantages of using long-temporal information. In this paper, we propose a radar-camera multi-modal temporal enhanced occupancy prediction network, dubbed TEOcc. Our method is inspired by the success of utilizing temporal information in 3D object detection. Specifically, we introduce a temporal enhancement branch to learn temporal occupancy prediction. In this branch, we randomly discard the t−k input frame of the multi-view camera and predict its 3D occupancy by long-term and short-term temporal decoders separately with the information from other adjacent frames and multi-modal inputs. Besides, to reduce computational costs and incorporate multi-modal inputs, we specially designed 3D convolutional layers for long-term and short-term temporal decoders. Furthermore, since the lightweight occupancy prediction head is a dense classification head, we propose to use a shared occupancy prediction head for the temporal enhancement and main branches. It is worth noting that the temporal enhancement branch is only performed during training and is discarded during inference. Experiment results demonstrate that TEOcc achieves state-of-the-art occupancy prediction on nuScenes benchmarks. In addition, the proposed temporal enhancement branch is a plug-and-play module that can be easily integrated into existing occupancy prediction methods to improve the performance of occupancy prediction. The source code and models will be released at unmapped: uri https://github.com/VDIGPKU/TEOcc.
Lane detection is an important yet challenging task in autonomous driving systems. Based on the development of the Visual Transformer, early Transformer-based lane detection studies have achieved promising results in some scenarios. However, for complex road conditions such as uneven illumination intensity and heavy traffic, the performance of these methods remains limited and may even be worse than that of contemporaneous CNN-based methods. In this paper, we propose a novel Transformer-based end-to-end network, called SinLane, that attains the attention weights focusing on the sparse yet meaningful locations and improves the accuracy of lane detection in complex environments. SinLane is composed of a novel Siamese Visual Transformer structure and a novel Feature Pyramid Network (FPN) structure called Pyramid Feature Integration (PFI). We utilize the proposed PFI to better integrate global semantics and finer-scale features and to promote the optimization of the Transformer. Moreover, the designed Siamese Visual Transformer is combined with multiple levels of the PFI and is employed to refine the multi-scale lane line features output from the PFI. Extensive experiments on three benchmark datasets of lane detection demonstrate that our SinLane achieves state-of-the-art results with high accuracy and efficiency. Specifically, our SinLane improves the accuracy by over 3% compared to the current best-performing Transformer-based method for lane detection on CULane.