

Federated learning (FL) has a great potential in large-scale machine learning applications by training a global model over distributed client data. However, FL deployed in real-world applications often incur collaboration bias and unstable convergence with inconsistent local predictions, resulting in poor modelling performance on heterogeneous and long-tailed client data distributions. In this paper, we reconsider heterogeneous FL in a two-stage learning paradigm where representation learning and classifier re-training are separated to incorporate different sampling schemes. This allows us to deal with the dilemma of obtaining more generalizable features and fine tuning a biased classifier building on client model aggregations. Specifically, we propose a novel hybrid knowledge distillation scheme, called FedHyb, to facilitate the two-stage learning. From the view of knowledge transfer, we show that FedHyb enables several desirable properties in the global feature space and optimization with fine-tuning, thus achieving better test accuracy and convergence speed, especially with a higher level of data heterogeneity and an increasing number of distributed clients. FedHyb does not require any information exchange between clients preventing privacy leakage, and is more robust under poisoning attacks comparing with other FL methods designed on heterogeneous data.