FLINK-5734 : Code Generation for NormalizedKeySorter

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

FLINK-5734 : Code Generation for NormalizedKeySorter

Pattarawat Chormai
Hi all,

We’re working on FLINK-5734 : Implement code generation for NormalizedKeySorter. We have just finished a design document that describes how we’re going to proceed this implementation.

Because this implement will touch several important parts of Flink, so we would appreciate if you could give us some feedback before starting implementation.
Here is the link to the document : https://docs.google.com/document/d/1anGQhBn9qI0yqe7twVvrDIiym4U4gxalJkZzM4Ar4QM/edit?usp=sharing <https://docs.google.com/document/d/1anGQhBn9qI0yqe7twVvrDIiym4U4gxalJkZzM4Ar4QM/edit?usp=sharing>

Also, we would like to confirm that NormalizedKeySorter is always instantiated at TaskManager, but we could not find any solution to verify that yet. If you know how to do it, we would be very pleased to hear that.

Thank you in advance for your feedback.

Best regards,
Serkan, Gábor and Pat
Reply | Threaded
Open this post in threaded view
|

Re: FLINK-5734 : Code Generation for NormalizedKeySorter

Till Rohrmann
Hi Pat, Serkan and Gabor,

I really like your design document and the preliminary results look really
good. Impressive! I think your document can be converted into a FLIP.

I was wondering whether it makes sense (or not) to generate the code on the
client side? Then we would not have to introduce the `TemplateManager`,
which avoids increasing the complexity of the TM, and we wouldn't need to
ship TypeInformation to the TMs (I know this already happens but initially
the idea was that TypeInformation is only used in the pre-flight phase).
Why did you decide to generate the code on the TMs?

Cheers,
Till





On Tue, Feb 7, 2017 at 10:34 PM, Pattarawat Chormai <[hidden email]>
wrote:

> Hi all,
>
> We’re working on FLINK-5734 : Implement code generation for
> NormalizedKeySorter. We have just finished a design document that describes
> how we’re going to proceed this implementation.
>
> Because this implement will touch several important parts of Flink, so we
> would appreciate if you could give us some feedback before starting
> implementation.
> Here is the link to the document : https://docs.google.com/document/d/
> 1anGQhBn9qI0yqe7twVvrDIiym4U4gxalJkZzM4Ar4QM/edit?usp=sharing <
> https://docs.google.com/document/d/1anGQhBn9qI0yqe7twVvrDIiym4U4g
> xalJkZzM4Ar4QM/edit?usp=sharing>
>
> Also, we would like to confirm that NormalizedKeySorter is always
> instantiated at TaskManager, but we could not find any solution to verify
> that yet. If you know how to do it, we would be very pleased to hear that.
>
> Thank you in advance for your feedback.
>
> Best regards,
> Serkan, Gábor and Pat
Reply | Threaded
Open this post in threaded view
|

Re: FLINK-5734 : Code Generation for NormalizedKeySorter

Gábor Gévay
Hello Till,

> Why did you decide to generate the code on the TMs?

If we generated on the client side, then we would need to serialize
instances of the generated classes when shipping the job to the TMs,
but we would really like to avoid serializing instances of the
generated classes.
In the other code generation effort [1] (which is for the serializers
and comparators), we tried to do the generation on the client side,
because the serializers and comparators get instantiated on the client
side. However, it turned out that serializing and shipping instances
of a generated class to an other machine is a really big mess: When
you want to deserialize an object on an other machine then you need to
be able to load the class of that object on that machine. However, the
classloaders on that machine don't know about the class that was
generated in the client machine. (We ended up including the source
code of the generated class into the serialization as a string, and
essentially redoing the generation in readObject.)

Best,
Gábor

[1] https://github.com/apache/flink/pull/2211






2017-02-08 12:01 GMT+01:00 Till Rohrmann <[hidden email]>:

> Hi Pat, Serkan and Gabor,
>
> I really like your design document and the preliminary results look really
> good. Impressive! I think your document can be converted into a FLIP.
>
> I was wondering whether it makes sense (or not) to generate the code on the
> client side? Then we would not have to introduce the `TemplateManager`,
> which avoids increasing the complexity of the TM, and we wouldn't need to
> ship TypeInformation to the TMs (I know this already happens but initially
> the idea was that TypeInformation is only used in the pre-flight phase).
> Why did you decide to generate the code on the TMs?
>
> Cheers,
> Till
>
>
>
>
>
> On Tue, Feb 7, 2017 at 10:34 PM, Pattarawat Chormai <[hidden email]>
> wrote:
>
>> Hi all,
>>
>> We’re working on FLINK-5734 : Implement code generation for
>> NormalizedKeySorter. We have just finished a design document that describes
>> how we’re going to proceed this implementation.
>>
>> Because this implement will touch several important parts of Flink, so we
>> would appreciate if you could give us some feedback before starting
>> implementation.
>> Here is the link to the document : https://docs.google.com/document/d/
>> 1anGQhBn9qI0yqe7twVvrDIiym4U4gxalJkZzM4Ar4QM/edit?usp=sharing <
>> https://docs.google.com/document/d/1anGQhBn9qI0yqe7twVvrDIiym4U4g
>> xalJkZzM4Ar4QM/edit?usp=sharing>
>>
>> Also, we would like to confirm that NormalizedKeySorter is always
>> instantiated at TaskManager, but we could not find any solution to verify
>> that yet. If you know how to do it, we would be very pleased to hear that.
>>
>> Thank you in advance for your feedback.
>>
>> Best regards,
>> Serkan, Gábor and Pat
Reply | Threaded
Open this post in threaded view
|

Re: FLINK-5734 : Code Generation for NormalizedKeySorter

Greg Hogan
In reply to this post by Pattarawat Chormai
Hi Pat, Serkan, and Gábor,

This looks very nice. I'll treat this like a pre-FLIP and ask my question
here.

Do I understand correctly that the generated code is only dependent on the
length of the sort key? So we could separate the writing and reading of
keys and records and from the generated code to compare and swap. Most jobs
will use similar key lengths so the generated code could be stored in an
LRU cache which would not impact startup performance.

From the document, "we have found that the DividedByConstant approach is
almost as good as UsingBitwiseOperators, and does not require
indexEntriesPerSegment to be a power of 2". How do you propose computing
indexEntriesPerSegment? Is the improved performance not dependent on this
value being a power-of-2?

Are you able to compare with FLINK-3722 [1]? A disadvantage of both
DividedByConstant and UsingBitwiseOperators when using a power-of-2 is that
up to half of the sort memory can be unused. FLINK-3722 paginates QuickSort
to use increment and decrement in place of divide and modulus.

[1] https://github.com/apache/flink/pull/2628/files

Greg

On Tue, Feb 7, 2017 at 4:34 PM, Pattarawat Chormai <[hidden email]>
wrote:

> Hi all,
>
> We’re working on FLINK-5734 : Implement code generation for
> NormalizedKeySorter. We have just finished a design document that describes
> how we’re going to proceed this implementation.
>
> Because this implement will touch several important parts of Flink, so we
> would appreciate if you could give us some feedback before starting
> implementation.
> Here is the link to the document : https://docs.google.com/document/d/
> 1anGQhBn9qI0yqe7twVvrDIiym4U4gxalJkZzM4Ar4QM/edit?usp=sharing <
> https://docs.google.com/document/d/1anGQhBn9qI0yqe7twVvrDIiym4U4g
> xalJkZzM4Ar4QM/edit?usp=sharing>
>
> Also, we would like to confirm that NormalizedKeySorter is always
> instantiated at TaskManager, but we could not find any solution to verify
> that yet. If you know how to do it, we would be very pleased to hear that.
>
> Thank you in advance for your feedback.
>
> Best regards,
> Serkan, Gábor and Pat
Reply | Threaded
Open this post in threaded view
|

Re: FLINK-5734 : Code Generation for NormalizedKeySorter

Till Rohrmann
I'm not sure how the code generation and compilation works in detail, but
aren't you able to write a class file as the result of the compilation?
Then this class file could be uploaded to the JM and from there to all TMs.
The class basically becomes another job dependency like the user code jar
and will then be part of the user code class loader. Then the
deserialization should be possible, right? But I'm not sure whether you
have the user code class loader available at the places where you
deserialize the code.

Cheers,
Till

On Wed, Feb 8, 2017 at 10:54 PM, Greg Hogan <[hidden email]> wrote:

> Hi Pat, Serkan, and Gábor,
>
> This looks very nice. I'll treat this like a pre-FLIP and ask my question
> here.
>
> Do I understand correctly that the generated code is only dependent on the
> length of the sort key? So we could separate the writing and reading of
> keys and records and from the generated code to compare and swap. Most jobs
> will use similar key lengths so the generated code could be stored in an
> LRU cache which would not impact startup performance.
>
> From the document, "we have found that the DividedByConstant approach is
> almost as good as UsingBitwiseOperators, and does not require
> indexEntriesPerSegment to be a power of 2". How do you propose computing
> indexEntriesPerSegment? Is the improved performance not dependent on this
> value being a power-of-2?
>
> Are you able to compare with FLINK-3722 [1]? A disadvantage of both
> DividedByConstant and UsingBitwiseOperators when using a power-of-2 is that
> up to half of the sort memory can be unused. FLINK-3722 paginates QuickSort
> to use increment and decrement in place of divide and modulus.
>
> [1] https://github.com/apache/flink/pull/2628/files
>
> Greg
>
> On Tue, Feb 7, 2017 at 4:34 PM, Pattarawat Chormai <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > We’re working on FLINK-5734 : Implement code generation for
> > NormalizedKeySorter. We have just finished a design document that
> describes
> > how we’re going to proceed this implementation.
> >
> > Because this implement will touch several important parts of Flink, so we
> > would appreciate if you could give us some feedback before starting
> > implementation.
> > Here is the link to the document : https://docs.google.com/document/d/
> > 1anGQhBn9qI0yqe7twVvrDIiym4U4gxalJkZzM4Ar4QM/edit?usp=sharing <
> > https://docs.google.com/document/d/1anGQhBn9qI0yqe7twVvrDIiym4U4g
> > xalJkZzM4Ar4QM/edit?usp=sharing>
> >
> > Also, we would like to confirm that NormalizedKeySorter is always
> > instantiated at TaskManager, but we could not find any solution to verify
> > that yet. If you know how to do it, we would be very pleased to hear
> that.
> >
> > Thank you in advance for your feedback.
> >
> > Best regards,
> > Serkan, Gábor and Pat
>
Reply | Threaded
Open this post in threaded view
|

Re: FLINK-5734 : Code Generation for NormalizedKeySorter

Pattarawat Chormai
In reply to this post by Greg Hogan
Hi Greg,

Thanks for your feedback. I would like to answer your questions here.

Q: Do I understand correctly that the generated code is only dependent on the
length of the sort key?

A: Yes, you're right. The generated code is mainly dependent on the length of the sort key. However, I'm not sure whether we can reuse the generated code for 2 different types of key even they have the same length.

Q: How do you propose computing indexEntriesPerSegment? Is the improved performance not dependent on this value being a power-of-2?
 
A: This improvement doesn't change the way indexEntriesPerSegment is computed. What we did here is replacing indexEntriesPerSegment at expressions that indexEntriesPerSegment is a divisor with its value [1]. As a result, it does not require value of indexEntriesPerSegment to be an exponent of 2. More information related to this technique can be found at [2].

Q: Are you able to compare with FLINK-3722?

A: We're planning to do that as well, however, it might take some time due to exams.

Also, could you please elaborate more why DividedByConstant and UsingBitwiseOperators use only half of sort memory when indexEntriesPerSegment is power of 2?


[1] https://github.com/heytitle/flink-sorter-performance-evaluation/blob/master/src/main/java/org/evaluation/sorter/individual/optimization/DividedByConstant.java#L353
[2] https://blogs.msdn.microsoft.com/devdev/2005/12/12/integer-division-by-constants/
Reply | Threaded
Open this post in threaded view
|

Re: FLINK-5734 : Code Generation for NormalizedKeySorter

Greg Hogan
On Thu, Feb 9, 2017 at 3:50 PM, pat.chormai <[hidden email]> wrote:

> Hi Greg,
>
> Thanks for your feedback. I would like to answer your questions here.
>
> Q: Do I understand correctly that the generated code is only dependent on
> the
> length of the sort key?
>
> A: Yes, you're right. The generated code is mainly dependent on the length
> of the sort key. However, I'm not sure whether we can reuse the generated
> code for 2 different types of key even they have the same length.
>

The only polymorphic callsite is NormalizedKeySorter's call to
TypeComparator.compareSerialized(DataInputView, DataInputView).


> Q: How do you propose computing indexEntriesPerSegment? Is the improved
> performance not dependent on this value being a power-of-2?
>
> A: This improvement doesn't change the way indexEntriesPerSegment is
> computed. What we did here is replacing indexEntriesPerSegment at
> expressions that indexEntriesPerSegment is a divisor with its value [1]. As
> a result, it does not require value of indexEntriesPerSegment to be an
> exponent of 2. More information related to this technique can be found at
> [2].
>

The example for dividing by 5 requires 2 multiples, 6 shifts, and 6
adds/subtracts. Then another multiple and subtract to get the remainder.
FLINK-3722 uses one or two adds/subtracts and a skewed branch prediction.


> Q: Are you able to compare with FLINK-3722?
>
> A: We're planning to do that as well, however, it might take some time due
> to exams.
>
> Also, could you please elaborate more why DividedByConstant and
> UsingBitwiseOperators use only half of sort memory when
> indexEntriesPerSegment is power of 2?
>

If the key size is one more than a power-of-2 then it looks like those
implementations are padding out to the next higher power-of-2 (e.g., key
size of 9 bytes needs 18 bytes but will consume 32 bytes).

[1]

> https://github.com/heytitle/flink-sorter-performance-
> evaluation/blob/master/src/main/java/org/evaluation/sorter/individual/
> optimization/DividedByConstant.java#L353
> [2]
> https://blogs.msdn.microsoft.com/devdev/2005/12/12/integer-
> division-by-constants/
>
>
>
> --
> View this message in context: http://apache-flink-mailing-
> list-archive.1008284.n3.nabble.com/FLINK-5734-Code-Generation-for-
> NormalizedKeySorter-tp15804p15860.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: FLINK-5734 : Code Generation for NormalizedKeySorter

Pattarawat Chormai
Hi [~greghogan]

I have done the benchmark comparing between FLINK-3722 and our approaches. As you can see at Score  column which represents sorting time, FLINK-3722 approach seems to be the fastest one.

(noRecords)                                                                      (sorterClass)    Score  Score error  Units
      10000                        org.apache.flink.runtime.operators.sort.NormalizedKeySorter    4.340        0.055     ms
      10000  org.evaluation.sorter.individual.optimization.FindSegmentIndexViaBitwiseOperators    3.333        0.053     ms
      10000                    org.evaluation.sorter.individual.optimization.DividedByConstant    3.493        0.060     ms
      10000                                                  org.flink3722.NormalizedKeySorter    3.001        0.061     ms
     100000                        org.apache.flink.runtime.operators.sort.NormalizedKeySorter   61.786        0.602     ms
     100000  org.evaluation.sorter.individual.optimization.FindSegmentIndexViaBitwiseOperators   42.217        0.529     ms
     100000                    org.evaluation.sorter.individual.optimization.DividedByConstant   45.141        0.460     ms
     100000                                                  org.flink3722.NormalizedKeySorter   35.903        0.586     ms
    1000000                        org.apache.flink.runtime.operators.sort.NormalizedKeySorter  582.881        0.547     ms
    1000000  org.evaluation.sorter.individual.optimization.FindSegmentIndexViaBitwiseOperators  419.972        1.057     ms
    1000000                    org.evaluation.sorter.individual.optimization.DividedByConstant  436.222        1.102     ms
    1000000                                                  org.flink3722.NormalizedKeySorter  386.953        0.725     ms

Reply | Threaded
Open this post in threaded view
|

Re: FLINK-5734 : Code Generation for NormalizedKeySorter

Gábor Gévay
Hello,

Pat, the table in your email is somehow not visible in my gmail, but
it is visible here:
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/FLINK-5734-Code-Generation-for-NormalizedKeySorter-tt15804.html#a15936
Maybe the problem is caused by the formatting.

> FLINK-3722
> approach seems to be the fastest one.

OK, then I would suggest to do the implementation of the code
generation on top of the PR for FLINK-3722. I guess we can assume that
that PR will be merged sooner than the code generation, so there won't
be any serious merge conflicts this way.

Best,
Gábor




2017-02-14 11:16 GMT+01:00 pat.chormai <[hidden email]>:

> Hi [~greghogan]
>
> I have done the benchmark comparing between FLINK-3722 and our approaches.
> As you can see at *Score * column which represents sorting time, FLINK-3722
> approach seems to be the fastest one.
>
>
>
>
>
> --
> View this message in context: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/FLINK-5734-Code-Generation-for-NormalizedKeySorter-tp15804p15936.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: FLINK-5734 : Code Generation for NormalizedKeySorter

Greg Hogan
Pat,

Thanks for adding the new test results. This idea for this implementation
was Gábor's from the FLINK-3722 description.

Since you will be filing a FLIP I recommend including these benchmarks for
consideration and discussion on the mailing list. In part because the PR is
4 months old and need of further review but also that there may be new
ideas or questions as to the form of the new code.

Greg

On Tue, Feb 14, 2017 at 5:40 AM, Gábor Gévay <[hidden email]> wrote:

> Hello,
>
> Pat, the table in your email is somehow not visible in my gmail, but
> it is visible here:
> http://apache-flink-mailing-list-archive.1008284.n3.
> nabble.com/FLINK-5734-Code-Generation-for-NormalizedKeySorter-tt15804.
> html#a15936
> Maybe the problem is caused by the formatting.
>
> > FLINK-3722
> > approach seems to be the fastest one.
>
> OK, then I would suggest to do the implementation of the code
> generation on top of the PR for FLINK-3722. I guess we can assume that
> that PR will be merged sooner than the code generation, so there won't
> be any serious merge conflicts this way.
>
> Best,
> Gábor
>
>
>
>
> 2017-02-14 11:16 GMT+01:00 pat.chormai <[hidden email]>:
> > Hi [~greghogan]
> >
> > I have done the benchmark comparing between FLINK-3722 and our
> approaches.
> > As you can see at *Score * column which represents sorting time,
> FLINK-3722
> > approach seems to be the fastest one.
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-flink-mailing-
> list-archive.1008284.n3.nabble.com/FLINK-5734-Code-Generation-for-
> NormalizedKeySorter-tp15804p15936.html
> > Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: FLINK-5734 : Code Generation for NormalizedKeySorter

Pattarawat Chormai
In reply to this post by Pattarawat Chormai
Hi all,

We have almost finished implementing the functionalities. The implementation is available at [1].

Also, we have included the benchmark result of FLINK-3722 into the FLIP[2] as well as other implementation details

We would be appreciated if you could give us feedback on this before actually submitting PR to Flink repo.

Best,
Pat

[1] https://github.com/heytitle/flink/pull/1
[2] https://docs.google.com/document/d/1anGQhBn9qI0yqe7twVvrDIiym4U4gxalJkZzM4Ar4QM
Reply | Threaded
Open this post in threaded view
|

Re: FLINK-5734 : Code Generation for NormalizedKeySorter

Greg Hogan
Hi Pat,

I’m still trying to understand the implications of Java’s Class Hierarchy Analysis [0].

Flink currently uses only a single implementation of InMemorySorter, which is NormalizedKeySorter. FLINK-4705 adds support for FixedLengthRecordSorter for Flink’s Value types and Tuples.

This proposal compiles an additional implementation of InMemorySorter for each key length. The calls from QuickSort and HeapSort will now be polymorphic during sorting, as will the calls from the drivers and UnilateralSortMerger.

Greg

[0]: http://flink.apache.org/news/2015/09/16/off-heap-memory.html <http://flink.apache.org/news/2015/09/16/off-heap-memory.html>
[1]: https://issues.apache.org/jira/browse/FLINK-4705 <https://issues.apache.org/jira/browse/FLINK-4705>


> On Mar 7, 2017, at 10:06 AM, pat.chormai <[hidden email]> wrote:
>
> Hi all,
>
> We have almost finished implementing the functionalities. The implementation
> is available at [1].
>
> Also, we have included the benchmark result of FLINK-3722 into the FLIP[2]
> as well as other implementation details
>
> We would be appreciated if you could give us feedback on this before
> actually submitting PR to Flink repo.
>
> Best,
> Pat
>
> [1] https://github.com/heytitle/flink/pull/1
> [2]
> https://docs.google.com/document/d/1anGQhBn9qI0yqe7twVvrDIiym4U4gxalJkZzM4Ar4QM
>
>
>
> --
> View this message in context: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/FLINK-5734-Code-Generation-for-NormalizedKeySorter-tp15804p16377.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at Nabble.com.