The emergence of High-Level Synthesis (HLS) techniques and tools, along with new features in high-end FPGAs such as multi-port memory interfaces, has enabled designers to utilize FPGAs not only for compute-bound but also for memory-bound tasks. This paper explains how to efficiently parallelise histogram, as a memory-bound task, utilizing the OpenCL framework running on FPGA. We have run our implementation on three high-end FPGAs including Alpha Data 7v3, Alpha Data ADM-PCIE-KU3 and Xilinx KU115. The 256 fixed-width bins histogram running on 7v3, KU3 and KU115 platforms shows 8.38, 15.29 and 38.57 Giga bin Update Per Second (GUPS), respectively. The best result, i.e., 38.57 GUPS on KU115 platform defeats the Nvidia GeForce 1060 GPU with 31.36 GUPS. In addition, it shows better performance than the one obtained in the dual socket 8-core Intel Xeon E5-2690 with 13 GUPS and 60-core Intel Xeon Phi 5110P coprocessor with 18 GUPS. The proposed implementation is not sensitive to locally invariant (LI) data sets, while the performance of GPU and CPU implementations drops with LI data. Processing locally invariant data sets shows that our FPGA implementation can be up to 91.4% and 44.9% faster than that of the GeForce 1060 and 1080 GPUs, respectively. The source codes of the designs are available at https://github.com/Hosseinabady/histogram_sdaccel.