Technology scaling allows manufacturers to integrate more cores into a single chip (i.e., many-core processor), which delivers higher throughput by exploiting thread- and application-level parallelism. However, the chip's power and thermal constraints, which do not scale well with technology scaling, began to limit the maximum throughput that can be delivered by many-core processors. Meanwhile, the integration of more cores with technology scaling increases within-die (WID) core-to-core (C2C) frequency and power variations, reducing performance/power efficiency of many-core processors. In this paper, we propose an optimization technique that can maximize the throughput of power- and thermal-constrained processors considering the WID C2C variations. The optimization technique exploits our following three observations. First, the WID C2C variations result in different power and frequency trade-offs between fast and slow cores in a processor. Second, the throughput is proportional to the average frequency of cores in a processor when the processor allows each core at its own frequency (i.e., per-core clocking). Third, fast cores, which consume more power due to higher frequency and leakage, experience more thermal throttling than slow cores. Our experiments using a 32nm technology demonstrate that the proposed optimization technique, which balances power consumption between the cores, is very effective for processors exhibiting large C2C frequency and power variances. The results show that the maximum throughput of 16-core processors with high C2C power variance can be improved by nearly up to 10%.