[DISCUSS] FLIP-18: Code Generation for improving sorting performance

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] FLIP-18: Code Generation for improving sorting performance

Pattarawat Chormai
Hi all,

I would like to initiate a discussion of applying code generation to NormalizedKeySorter. The goal is to improve sorting performance by generating suitable NormalizedKeySorter for underlying data. This generated sorter will contains only necessary code in important methods, such as swap and compare, hence improving sorting performance. 

Details of the implementation is illustrated at FLIP-18 : Code Generation for improving sorting performance.


Also, because we’re doing it as a course project at TUB, we have completed the implementation and made a pull-request to Flink repo already.

From our evaluation, we have found that the pull-request reduces sorting time around 7-10% and together with FLINK-3722 the sorting time is decreased by 12-20%.



Please take a look at the document and the pull-request and let me know if you have any suggestion.

Best,
Pat
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-18: Code Generation for improving sorting performance

Fabian Hueske-2
Hi Pat,

thanks a lot for this great proposal! I think it is very well structured
and has the right level of detail.
The improvements of your performance benchmarks look very promising and I
think code-gen'd sorters would be a very nice improvement.
I like that you plan to add a switch to activate this feature.

In order move on, we will need a committer who "champions" your FLIP,
reviews the pull request, and eventually merges it.

@Greg and @Stephan, what do you think about this proposal?

Best, Fabian


2017-03-14 16:10 GMT+01:00 Pattarawat Chormai <[hidden email]>:

> Hi all,
>
> I would like to initiate a discussion of applying code generation to
> NormalizedKeySorter. The goal is to improve sorting performance
> by generating suitable NormalizedKeySorter for underlying data. This
> generated sorter will contains only necessary code in important methods,
> such as swap and compare, hence improving sorting performance.
>
> Details of the implementation is illustrated at FLIP-18 : Code Generation
> for improving sorting performance.
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-18%3A+Code+Generation+for+improving+sorting+performance>
>
>
> Also, because we’re doing it as a course project at TUB, we have
> completed the implementation and made a pull-request
> <https://github.com/apache/flink/pull/3511> to Flink repo already.
>
> From our evaluation, we have found that the pull-request reduces sorting
> time around 7-10% and together with FLINK-3722
> <https://issues.apache.org/jira/browse/FLINK-3722> the sorting time is
> decreased by 12-20%.
>
>
>
> Please take a look at the document and the pull-request and let me know if
> you have any suggestion.
>
> Best,
> Pat
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-18: Code Generation for improving sorting performance

Greg Hogan
I would be more than happy to shepherd and review this PR.

I have two discussion points. First, a strategy for developing with templates. IntelliJ has a FreeMarker plugin but we lose formatting and code completion. To minimize this issue we can retain the untemplated code in an abstract class which is then concretely subclassed by the template.

Second, additional classes will turn performance critical callsites megamorphic. Stephan noted this issue in his work on MemorySegment.
  http://flink.apache.org/news/2015/09/16/off-heap-memory.html

For example, QuickSort calls IndexedSortable#compare and IndexedSortable#swap. With multiple compiled implementations of the sorter template these callsites can no longer be inlined (the same is true with NormalizedKeySorter and FixedLengthRecordSorter if the latter was instrumented).

I have not found a way to duplicate a Java class at runtime, but we may be able to use Janino to compile a class which is then uniquely renamed: each IndexSortable type would map to a different QuickSort type (same bytecode, but uniquely optimized). This should also boost performance of runtime operators calling user defined functions.

Given the code already written, I expect we can refactor, review, and benchmark for the 1.3 release.

Greg


> On Mar 21, 2017, at 3:46 PM, Fabian Hueske <[hidden email]> wrote:
>
> Hi Pat,
>
> thanks a lot for this great proposal! I think it is very well structured and has the right level of detail.
> The improvements of your performance benchmarks look very promising and I think code-gen'd sorters would be a very nice improvement.
> I like that you plan to add a switch to activate this feature.
>
> In order move on, we will need a committer who "champions" your FLIP, reviews the pull request, and eventually merges it.
>
> @Greg and @Stephan, what do you think about this proposal?
>
> Best, Fabian
>
>
> 2017-03-14 16:10 GMT+01:00 Pattarawat Chormai <[hidden email] <mailto:[hidden email]>>:
> Hi all,
>
> I would like to initiate a discussion of applying code generation to NormalizedKeySorter. The goal is to improve sorting performance by generating suitable NormalizedKeySorter for underlying data. This generated sorter will contains only necessary code in important methods, such as swap and compare, hence improving sorting performance.
>
> Details of the implementation is illustrated at FLIP-18 : Code Generation for improving sorting performance. <https://cwiki.apache.org/confluence/display/FLINK/FLIP-18%3A+Code+Generation+for+improving+sorting+performance>
>
>
> Also, because we’re doing it as a course project at TUB, we have completed the implementation and made a pull-request <https://github.com/apache/flink/pull/3511> to Flink repo already.
>
> From our evaluation, we have found that the pull-request reduces sorting time around 7-10% and together with FLINK-3722 <https://issues.apache.org/jira/browse/FLINK-3722> the sorting time is decreased by 12-20%.
>
>
>
> Please take a look at the document and the pull-request and let me know if you have any suggestion.
>
> Best,
> Pat
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-18: Code Generation for improving sorting performance

Gábor Gévay
Hello,

> Second, additional classes will turn performance critical callsites megamorphic.

Yes, this is a completely valid point, thanks for raising this issue
Greg. We were planning to have an offline discussion tomorrow with
Pattarawat about this. We have a few options:
1. We could fuse the QuickSort and NormalizedKeySorter into a single
class when generating the code.
2. As you said, duplicating the QuickSort class might also be a nice
solution, if we can find a simple way to pull it off.
3. We might do some benchmarks and determine that the effect of this
issue is negligible and therefore we can ignore it. But I'm worried
that this is not the case, so we'll probably have to go with one of
the above options.

Best,
Gábor



On Thu, Mar 23, 2017 at 6:12 PM, Greg Hogan <[hidden email]> wrote:

> I would be more than happy to shepherd and review this PR.
>
> I have two discussion points. First, a strategy for developing with templates. IntelliJ has a FreeMarker plugin but we lose formatting and code completion. To minimize this issue we can retain the untemplated code in an abstract class which is then concretely subclassed by the template.
>
> Second, additional classes will turn performance critical callsites megamorphic. Stephan noted this issue in his work on MemorySegment.
>   http://flink.apache.org/news/2015/09/16/off-heap-memory.html
>
> For example, QuickSort calls IndexedSortable#compare and IndexedSortable#swap. With multiple compiled implementations of the sorter template these callsites can no longer be inlined (the same is true with NormalizedKeySorter and FixedLengthRecordSorter if the latter was instrumented).
>
> I have not found a way to duplicate a Java class at runtime, but we may be able to use Janino to compile a class which is then uniquely renamed: each IndexSortable type would map to a different QuickSort type (same bytecode, but uniquely optimized). This should also boost performance of runtime operators calling user defined functions.
>
> Given the code already written, I expect we can refactor, review, and benchmark for the 1.3 release.
>
> Greg
>
>
>> On Mar 21, 2017, at 3:46 PM, Fabian Hueske <[hidden email]> wrote:
>>
>> Hi Pat,
>>
>> thanks a lot for this great proposal! I think it is very well structured and has the right level of detail.
>> The improvements of your performance benchmarks look very promising and I think code-gen'd sorters would be a very nice improvement.
>> I like that you plan to add a switch to activate this feature.
>>
>> In order move on, we will need a committer who "champions" your FLIP, reviews the pull request, and eventually merges it.
>>
>> @Greg and @Stephan, what do you think about this proposal?
>>
>> Best, Fabian
>>
>>
>> 2017-03-14 16:10 GMT+01:00 Pattarawat Chormai <[hidden email] <mailto:[hidden email]>>:
>> Hi all,
>>
>> I would like to initiate a discussion of applying code generation to NormalizedKeySorter. The goal is to improve sorting performance by generating suitable NormalizedKeySorter for underlying data. This generated sorter will contains only necessary code in important methods, such as swap and compare, hence improving sorting performance.
>>
>> Details of the implementation is illustrated at FLIP-18 : Code Generation for improving sorting performance. <https://cwiki.apache.org/confluence/display/FLINK/FLIP-18%3A+Code+Generation+for+improving+sorting+performance>
>>
>>
>> Also, because we’re doing it as a course project at TUB, we have completed the implementation and made a pull-request <https://github.com/apache/flink/pull/3511> to Flink repo already.
>>
>> From our evaluation, we have found that the pull-request reduces sorting time around 7-10% and together with FLINK-3722 <https://issues.apache.org/jira/browse/FLINK-3722> the sorting time is decreased by 12-20%.
>>
>>
>>
>> Please take a look at the document and the pull-request and let me know if you have any suggestion.
>>
>> Best,
>> Pat
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-18: Code Generation for improving sorting performance

Pattarawat Chormai
Hi,

Gábor and I plan to investigate the metamorphic call issue further. We will implement the idea of combining QuickSort and NormalizedKeySorter together in generated code and benchmark the improvement with a Flink’s job that uses 3 different sorters.

Best,
Pat


> On Mar 23, 2017, at 6:31 PM, Gábor Gévay <[hidden email]> wrote:
>
> Hello,
>
>> Second, additional classes will turn performance critical callsites megamorphic.
>
> Yes, this is a completely valid point, thanks for raising this issue
> Greg. We were planning to have an offline discussion tomorrow with
> Pattarawat about this. We have a few options:
> 1. We could fuse the QuickSort and NormalizedKeySorter into a single
> class when generating the code.
> 2. As you said, duplicating the QuickSort class might also be a nice
> solution, if we can find a simple way to pull it off.
> 3. We might do some benchmarks and determine that the effect of this
> issue is negligible and therefore we can ignore it. But I'm worried
> that this is not the case, so we'll probably have to go with one of
> the above options.
>
> Best,
> Gábor
>
>
>
> On Thu, Mar 23, 2017 at 6:12 PM, Greg Hogan <[hidden email]> wrote:
>> I would be more than happy to shepherd and review this PR.
>>
>> I have two discussion points. First, a strategy for developing with templates. IntelliJ has a FreeMarker plugin but we lose formatting and code completion. To minimize this issue we can retain the untemplated code in an abstract class which is then concretely subclassed by the template.
>>
>> Second, additional classes will turn performance critical callsites megamorphic. Stephan noted this issue in his work on MemorySegment.
>>  http://flink.apache.org/news/2015/09/16/off-heap-memory.html
>>
>> For example, QuickSort calls IndexedSortable#compare and IndexedSortable#swap. With multiple compiled implementations of the sorter template these callsites can no longer be inlined (the same is true with NormalizedKeySorter and FixedLengthRecordSorter if the latter was instrumented).
>>
>> I have not found a way to duplicate a Java class at runtime, but we may be able to use Janino to compile a class which is then uniquely renamed: each IndexSortable type would map to a different QuickSort type (same bytecode, but uniquely optimized). This should also boost performance of runtime operators calling user defined functions.
>>
>> Given the code already written, I expect we can refactor, review, and benchmark for the 1.3 release.
>>
>> Greg
>>
>>
>>> On Mar 21, 2017, at 3:46 PM, Fabian Hueske <[hidden email]> wrote:
>>>
>>> Hi Pat,
>>>
>>> thanks a lot for this great proposal! I think it is very well structured and has the right level of detail.
>>> The improvements of your performance benchmarks look very promising and I think code-gen'd sorters would be a very nice improvement.
>>> I like that you plan to add a switch to activate this feature.
>>>
>>> In order move on, we will need a committer who "champions" your FLIP, reviews the pull request, and eventually merges it.
>>>
>>> @Greg and @Stephan, what do you think about this proposal?
>>>
>>> Best, Fabian
>>>
>>>
>>> 2017-03-14 16:10 GMT+01:00 Pattarawat Chormai <[hidden email] <mailto:[hidden email]>>:
>>> Hi all,
>>>
>>> I would like to initiate a discussion of applying code generation to NormalizedKeySorter. The goal is to improve sorting performance by generating suitable NormalizedKeySorter for underlying data. This generated sorter will contains only necessary code in important methods, such as swap and compare, hence improving sorting performance.
>>>
>>> Details of the implementation is illustrated at FLIP-18 : Code Generation for improving sorting performance. <https://cwiki.apache.org/confluence/display/FLINK/FLIP-18%3A+Code+Generation+for+improving+sorting+performance>
>>>
>>>
>>> Also, because we’re doing it as a course project at TUB, we have completed the implementation and made a pull-request <https://github.com/apache/flink/pull/3511> to Flink repo already.
>>>
>>> From our evaluation, we have found that the pull-request reduces sorting time around 7-10% and together with FLINK-3722 <https://issues.apache.org/jira/browse/FLINK-3722> the sorting time is decreased by 12-20%.
>>>
>>>
>>>
>>> Please take a look at the document and the pull-request and let me know if you have any suggestion.
>>>
>>> Best,
>>> Pat
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-18: Code Generation for improving sorting performance

Pattarawat Chormai
In reply to this post by Pattarawat Chormai
Hi guys,

I have made an additional optimization[1] related to megamorphic call issue that Greg mentioned earlier. The optimization[2] improves execution time around ~13%, while the original code from FLINK-5734 is ~11%.

IMHO, the improvement from metamorphic call optimization is very small compared to the code we have to introduce. So, I think we can just go with the PR that we currently have. What do you think?

[1] https://github.com/heytitle/flink/commit/8e38b4d738b9953337361c62a8d77e909327d28f
[2]https://docs.google.com/spreadsheets/d/1PcdCdFX4bGecO6Lb5dLI2nww2NoeaA8c3MgbEdsVmk0/edit#gid=598217386

Best,
Pat
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-18: Code Generation for improving sorting performance

Greg Hogan
Pat,

Thanks for running additional tests and continuing to work on this contribution.

My testing is also showing that the performance gains remain even when multiple classes are used for sorting.

I think we should proceed in the order of FLINK-3722, FLINK-4705, and FLINK-5734. Gabor has reviewed FLINK-3722 and I’ve done so multiple times. I’m looking into test coverage for FLINK-4705. Once these are reviewed and FLINK-5734 rebased we can benchmark Flink’s performance to validate the improvements.

Greg


> On Apr 3, 2017, at 8:46 PM, Pattarawat Chormai <[hidden email]> wrote:
>
> Hi guys,
>
> I have made an additional optimization[1] related to megamorphic call issue
> that Greg mentioned earlier. The optimization[2] improves execution time
> around ~13%, while the original code from FLINK-5734 is ~11%.
>
> IMHO, the improvement from metamorphic call optimization is very small
> compared to the code we have to introduce. So, I think we can just go with
> the PR that we currently have. What do you think?
>
> [1]
> https://github.com/heytitle/flink/commit/8e38b4d738b9953337361c62a8d77e909327d28f
> [2]https://docs.google.com/spreadsheets/d/1PcdCdFX4bGecO6Lb5dLI2nww2NoeaA8c3MgbEdsVmk0/edit#gid=598217386
>
> Best,
> Pat
>
>
>
> --
> View this message in context: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-18-Code-Generation-for-improving-sorting-performance-tp16486p16923.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-18: Code Generation for improving sorting performance

Pattarawat Chormai
Hi Greg,

Thanks for the input.  Please let me know If you need some help. Meanwhile, I will look into code of the tasks that you mentioned.

Best,
Pat

> On Apr 5, 2017, at 5:42 PM, Greg Hogan <[hidden email]> wrote:
>
> Pat,
>
> Thanks for running additional tests and continuing to work on this contribution.
>
> My testing is also showing that the performance gains remain even when multiple classes are used for sorting.
>
> I think we should proceed in the order of FLINK-3722, FLINK-4705, and FLINK-5734. Gabor has reviewed FLINK-3722 and I’ve done so multiple times. I’m looking into test coverage for FLINK-4705. Once these are reviewed and FLINK-5734 rebased we can benchmark Flink’s performance to validate the improvements.
>
> Greg
>
>
>> On Apr 3, 2017, at 8:46 PM, Pattarawat Chormai <[hidden email]> wrote:
>>
>> Hi guys,
>>
>> I have made an additional optimization[1] related to megamorphic call issue
>> that Greg mentioned earlier. The optimization[2] improves execution time
>> around ~13%, while the original code from FLINK-5734 is ~11%.
>>
>> IMHO, the improvement from metamorphic call optimization is very small
>> compared to the code we have to introduce. So, I think we can just go with
>> the PR that we currently have. What do you think?
>>
>> [1]
>> https://github.com/heytitle/flink/commit/8e38b4d738b9953337361c62a8d77e909327d28f
>> [2]https://docs.google.com/spreadsheets/d/1PcdCdFX4bGecO6Lb5dLI2nww2NoeaA8c3MgbEdsVmk0/edit#gid=598217386
>>
>> Best,
>> Pat
>>
>>
>>
>> --
>> View this message in context: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-18-Code-Generation-for-improving-sorting-performance-tp16486p16923.html
>> Sent from the Apache Flink Mailing List archive. mailing list archive at Nabble.com.
>