[DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Xintong Song
Hi everyone,

We would like to start a discussion thread on "FLIP-49: Unified Memory
Configuration for TaskExecutors"[1], where we describe how to improve
TaskExecutor memory configurations. The FLIP document is mostly based on an
early design "Memory Management and Configuration Reloaded"[2] by Stephan,
with updates from follow-up discussions both online and offline.

This FLIP addresses several shortcomings of current (Flink 1.9)
TaskExecutor memory configuration.

   - Different configuration for Streaming and Batch.
   - Complex and difficult configuration of RocksDB in Streaming.
   - Complicated, uncertain and hard to understand.


Key changes to solve the problems can be summarized as follows.

   - Extend memory manager to also account for memory usage by state
   backends.
   - Modify how TaskExecutor memory is partitioned accounted individual
   memory reservations and pools.
   - Simplify memory configuration options and calculations logics.


Please find more details in the FLIP wiki document [1].

(Please note that the early design doc [2] is out of sync, and it is
appreciated to have the discussion in this mailing list thread.)


Looking forward to your feedbacks.


Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors

[2]
https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Yang Wang
Hi xintong,


Thanks for your detailed proposal. After all the memory configuration are
introduced, it will be more powerful to control the flink memory usage. I
just have few questions about it.



   - Native and Direct Memory

We do not differentiate user direct memory and native memory. They are all
included in task off-heap memory. Right? So i don’t think we could not set
the -XX:MaxDirectMemorySize properly. I prefer leaving it a very large
value.



   - Memory Calculation

If the sum of and fine-grained memory(network memory, managed memory, etc.)
is larger than total process memory, how do we deal with this situation? Do
we need to check the memory configuration in client?

Xintong Song <[hidden email]> 于2019年8月7日周三 下午10:14写道:

> Hi everyone,
>
> We would like to start a discussion thread on "FLIP-49: Unified Memory
> Configuration for TaskExecutors"[1], where we describe how to improve
> TaskExecutor memory configurations. The FLIP document is mostly based on an
> early design "Memory Management and Configuration Reloaded"[2] by Stephan,
> with updates from follow-up discussions both online and offline.
>
> This FLIP addresses several shortcomings of current (Flink 1.9)
> TaskExecutor memory configuration.
>
>    - Different configuration for Streaming and Batch.
>    - Complex and difficult configuration of RocksDB in Streaming.
>    - Complicated, uncertain and hard to understand.
>
>
> Key changes to solve the problems can be summarized as follows.
>
>    - Extend memory manager to also account for memory usage by state
>    backends.
>    - Modify how TaskExecutor memory is partitioned accounted individual
>    memory reservations and pools.
>    - Simplify memory configuration options and calculations logics.
>
>
> Please find more details in the FLIP wiki document [1].
>
> (Please note that the early design doc [2] is out of sync, and it is
> appreciated to have the discussion in this mailing list thread.)
>
>
> Looking forward to your feedbacks.
>
>
> Thank you~
>
> Xintong Song
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>
> [2]
>
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Xintong Song
Thanks for the feedback, Yang.

Regarding your comments:

*Native and Direct Memory*
I think setting a very large max direct memory size definitely has some
good sides. E.g., we do not worry about direct OOM, and we don't even need
to allocate managed / network memory with Unsafe.allocate() .
However, there are also some down sides of doing this.

   - One thing I can think of is that if a task executor container is
   killed due to overusing memory, it could be hard for use to know which part
   of the memory is overused.
   - Another down side is that the JVM never trigger GC due to reaching max
   direct memory limit, because the limit is too high to be reached. That
   means we kind of relay on heap memory to trigger GC and release direct
   memory. That could be a problem in cases where we have more direct memory
   usage but not enough heap activity to trigger the GC.

Maybe you can share your reasons for preferring setting a very large value,
if there are anything else I overlooked.

*Memory Calculation*
If there is any conflict between multiple configuration that user
explicitly specified, I think we should throw an error.
I think doing checking on the client side is a good idea, so that on Yarn /
K8s we can discover the problem before submitting the Flink cluster, which
is always a good thing.
But we can not only rely on the client side checking, because for
standalone cluster TaskManagers on different machines may have different
configurations and the client does see that.
What do you think?

Thank you~

Xintong Song



On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <[hidden email]> wrote:

> Hi xintong,
>
>
> Thanks for your detailed proposal. After all the memory configuration are
> introduced, it will be more powerful to control the flink memory usage. I
> just have few questions about it.
>
>
>
>    - Native and Direct Memory
>
> We do not differentiate user direct memory and native memory. They are all
> included in task off-heap memory. Right? So i don’t think we could not set
> the -XX:MaxDirectMemorySize properly. I prefer leaving it a very large
> value.
>
>
>
>    - Memory Calculation
>
> If the sum of and fine-grained memory(network memory, managed memory, etc.)
> is larger than total process memory, how do we deal with this situation? Do
> we need to check the memory configuration in client?
>
> Xintong Song <[hidden email]> 于2019年8月7日周三 下午10:14写道:
>
> > Hi everyone,
> >
> > We would like to start a discussion thread on "FLIP-49: Unified Memory
> > Configuration for TaskExecutors"[1], where we describe how to improve
> > TaskExecutor memory configurations. The FLIP document is mostly based on
> an
> > early design "Memory Management and Configuration Reloaded"[2] by
> Stephan,
> > with updates from follow-up discussions both online and offline.
> >
> > This FLIP addresses several shortcomings of current (Flink 1.9)
> > TaskExecutor memory configuration.
> >
> >    - Different configuration for Streaming and Batch.
> >    - Complex and difficult configuration of RocksDB in Streaming.
> >    - Complicated, uncertain and hard to understand.
> >
> >
> > Key changes to solve the problems can be summarized as follows.
> >
> >    - Extend memory manager to also account for memory usage by state
> >    backends.
> >    - Modify how TaskExecutor memory is partitioned accounted individual
> >    memory reservations and pools.
> >    - Simplify memory configuration options and calculations logics.
> >
> >
> > Please find more details in the FLIP wiki document [1].
> >
> > (Please note that the early design doc [2] is out of sync, and it is
> > appreciated to have the discussion in this mailing list thread.)
> >
> >
> > Looking forward to your feedbacks.
> >
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> >
> > [2]
> >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Till Rohrmann
Thanks for proposing this FLIP Xintong.

All in all I think it already looks quite good. Concerning the first open
question about allocating memory segments, I was wondering whether this is
strictly necessary to do in the context of this FLIP or whether this could
be done as a follow up? Without knowing all details, I would be concerned
that we would widen the scope of this FLIP too much because we would have
to touch all the existing call sites of the MemoryManager where we allocate
memory segments (this should mainly be batch operators). The addition of
the memory reservation call to the MemoryManager should not be affected by
this and I would hope that this is the only point of interaction a
streaming job would have with the MemoryManager.

Concerning the second open question about setting or not setting a max
direct memory limit, I would also be interested why Yang Wang thinks
leaving it open would be best. My concern about this would be that we would
be in a similar situation as we are now with the RocksDBStateBackend. If
the different memory pools are not clearly separated and can spill over to
a different pool, then it is quite hard to understand what exactly causes a
process to get killed for using too much memory. This could then easily
lead to a similar situation what we have with the cutoff-ratio. So why not
setting a sane default value for max direct memory and giving the user an
option to increase it if he runs into an OOM.

@Xintong, how would alternative 2 lead to lower memory utilization than
alternative 3 where we set the direct memory to a higher value?

Cheers,
Till

On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <[hidden email]> wrote:

> Thanks for the feedback, Yang.
>
> Regarding your comments:
>
> *Native and Direct Memory*
> I think setting a very large max direct memory size definitely has some
> good sides. E.g., we do not worry about direct OOM, and we don't even need
> to allocate managed / network memory with Unsafe.allocate() .
> However, there are also some down sides of doing this.
>
>    - One thing I can think of is that if a task executor container is
>    killed due to overusing memory, it could be hard for use to know which
> part
>    of the memory is overused.
>    - Another down side is that the JVM never trigger GC due to reaching max
>    direct memory limit, because the limit is too high to be reached. That
>    means we kind of relay on heap memory to trigger GC and release direct
>    memory. That could be a problem in cases where we have more direct
> memory
>    usage but not enough heap activity to trigger the GC.
>
> Maybe you can share your reasons for preferring setting a very large value,
> if there are anything else I overlooked.
>
> *Memory Calculation*
> If there is any conflict between multiple configuration that user
> explicitly specified, I think we should throw an error.
> I think doing checking on the client side is a good idea, so that on Yarn /
> K8s we can discover the problem before submitting the Flink cluster, which
> is always a good thing.
> But we can not only rely on the client side checking, because for
> standalone cluster TaskManagers on different machines may have different
> configurations and the client does see that.
> What do you think?
>
> Thank you~
>
> Xintong Song
>
>
>
> On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <[hidden email]> wrote:
>
> > Hi xintong,
> >
> >
> > Thanks for your detailed proposal. After all the memory configuration are
> > introduced, it will be more powerful to control the flink memory usage. I
> > just have few questions about it.
> >
> >
> >
> >    - Native and Direct Memory
> >
> > We do not differentiate user direct memory and native memory. They are
> all
> > included in task off-heap memory. Right? So i don’t think we could not
> set
> > the -XX:MaxDirectMemorySize properly. I prefer leaving it a very large
> > value.
> >
> >
> >
> >    - Memory Calculation
> >
> > If the sum of and fine-grained memory(network memory, managed memory,
> etc.)
> > is larger than total process memory, how do we deal with this situation?
> Do
> > we need to check the memory configuration in client?
> >
> > Xintong Song <[hidden email]> 于2019年8月7日周三 下午10:14写道:
> >
> > > Hi everyone,
> > >
> > > We would like to start a discussion thread on "FLIP-49: Unified Memory
> > > Configuration for TaskExecutors"[1], where we describe how to improve
> > > TaskExecutor memory configurations. The FLIP document is mostly based
> on
> > an
> > > early design "Memory Management and Configuration Reloaded"[2] by
> > Stephan,
> > > with updates from follow-up discussions both online and offline.
> > >
> > > This FLIP addresses several shortcomings of current (Flink 1.9)
> > > TaskExecutor memory configuration.
> > >
> > >    - Different configuration for Streaming and Batch.
> > >    - Complex and difficult configuration of RocksDB in Streaming.
> > >    - Complicated, uncertain and hard to understand.
> > >
> > >
> > > Key changes to solve the problems can be summarized as follows.
> > >
> > >    - Extend memory manager to also account for memory usage by state
> > >    backends.
> > >    - Modify how TaskExecutor memory is partitioned accounted individual
> > >    memory reservations and pools.
> > >    - Simplify memory configuration options and calculations logics.
> > >
> > >
> > > Please find more details in the FLIP wiki document [1].
> > >
> > > (Please note that the early design doc [2] is out of sync, and it is
> > > appreciated to have the discussion in this mailing list thread.)
> > >
> > >
> > > Looking forward to your feedbacks.
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > >
> > > [2]
> > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Xintong Song
Thanks for replying, Till.

About MemorySegment, I think you are right that we should not include this
issue in the scope of this FLIP. This FLIP should concentrate on how to
configure memory pools for TaskExecutors, with minimum involvement on how
memory consumers use it.

About direct memory, I think alternative 3 may not having the same over
reservation issue that alternative 2 does, but at the cost of risk of over
using memory at the container level, which is not good. My point is that
both "Task Off-Heap Memory" and "JVM Overhead" are not easy to config. For
alternative 2, users might configure them higher than what actually needed,
just to avoid getting a direct OOM. For alternative 3, users do not get
direct OOM, so they may not config the two options aggressively high. But
the consequences are risks of overall container memory usage exceeds the
budget.

Thank you~

Xintong Song



On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <[hidden email]> wrote:

> Thanks for proposing this FLIP Xintong.
>
> All in all I think it already looks quite good. Concerning the first open
> question about allocating memory segments, I was wondering whether this is
> strictly necessary to do in the context of this FLIP or whether this could
> be done as a follow up? Without knowing all details, I would be concerned
> that we would widen the scope of this FLIP too much because we would have
> to touch all the existing call sites of the MemoryManager where we allocate
> memory segments (this should mainly be batch operators). The addition of
> the memory reservation call to the MemoryManager should not be affected by
> this and I would hope that this is the only point of interaction a
> streaming job would have with the MemoryManager.
>
> Concerning the second open question about setting or not setting a max
> direct memory limit, I would also be interested why Yang Wang thinks
> leaving it open would be best. My concern about this would be that we would
> be in a similar situation as we are now with the RocksDBStateBackend. If
> the different memory pools are not clearly separated and can spill over to
> a different pool, then it is quite hard to understand what exactly causes a
> process to get killed for using too much memory. This could then easily
> lead to a similar situation what we have with the cutoff-ratio. So why not
> setting a sane default value for max direct memory and giving the user an
> option to increase it if he runs into an OOM.
>
> @Xintong, how would alternative 2 lead to lower memory utilization than
> alternative 3 where we set the direct memory to a higher value?
>
> Cheers,
> Till
>
> On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <[hidden email]> wrote:
>
> > Thanks for the feedback, Yang.
> >
> > Regarding your comments:
> >
> > *Native and Direct Memory*
> > I think setting a very large max direct memory size definitely has some
> > good sides. E.g., we do not worry about direct OOM, and we don't even
> need
> > to allocate managed / network memory with Unsafe.allocate() .
> > However, there are also some down sides of doing this.
> >
> >    - One thing I can think of is that if a task executor container is
> >    killed due to overusing memory, it could be hard for use to know which
> > part
> >    of the memory is overused.
> >    - Another down side is that the JVM never trigger GC due to reaching
> max
> >    direct memory limit, because the limit is too high to be reached. That
> >    means we kind of relay on heap memory to trigger GC and release direct
> >    memory. That could be a problem in cases where we have more direct
> > memory
> >    usage but not enough heap activity to trigger the GC.
> >
> > Maybe you can share your reasons for preferring setting a very large
> value,
> > if there are anything else I overlooked.
> >
> > *Memory Calculation*
> > If there is any conflict between multiple configuration that user
> > explicitly specified, I think we should throw an error.
> > I think doing checking on the client side is a good idea, so that on
> Yarn /
> > K8s we can discover the problem before submitting the Flink cluster,
> which
> > is always a good thing.
> > But we can not only rely on the client side checking, because for
> > standalone cluster TaskManagers on different machines may have different
> > configurations and the client does see that.
> > What do you think?
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <[hidden email]> wrote:
> >
> > > Hi xintong,
> > >
> > >
> > > Thanks for your detailed proposal. After all the memory configuration
> are
> > > introduced, it will be more powerful to control the flink memory
> usage. I
> > > just have few questions about it.
> > >
> > >
> > >
> > >    - Native and Direct Memory
> > >
> > > We do not differentiate user direct memory and native memory. They are
> > all
> > > included in task off-heap memory. Right? So i don’t think we could not
> > set
> > > the -XX:MaxDirectMemorySize properly. I prefer leaving it a very large
> > > value.
> > >
> > >
> > >
> > >    - Memory Calculation
> > >
> > > If the sum of and fine-grained memory(network memory, managed memory,
> > etc.)
> > > is larger than total process memory, how do we deal with this
> situation?
> > Do
> > > we need to check the memory configuration in client?
> > >
> > > Xintong Song <[hidden email]> 于2019年8月7日周三 下午10:14写道:
> > >
> > > > Hi everyone,
> > > >
> > > > We would like to start a discussion thread on "FLIP-49: Unified
> Memory
> > > > Configuration for TaskExecutors"[1], where we describe how to improve
> > > > TaskExecutor memory configurations. The FLIP document is mostly based
> > on
> > > an
> > > > early design "Memory Management and Configuration Reloaded"[2] by
> > > Stephan,
> > > > with updates from follow-up discussions both online and offline.
> > > >
> > > > This FLIP addresses several shortcomings of current (Flink 1.9)
> > > > TaskExecutor memory configuration.
> > > >
> > > >    - Different configuration for Streaming and Batch.
> > > >    - Complex and difficult configuration of RocksDB in Streaming.
> > > >    - Complicated, uncertain and hard to understand.
> > > >
> > > >
> > > > Key changes to solve the problems can be summarized as follows.
> > > >
> > > >    - Extend memory manager to also account for memory usage by state
> > > >    backends.
> > > >    - Modify how TaskExecutor memory is partitioned accounted
> individual
> > > >    memory reservations and pools.
> > > >    - Simplify memory configuration options and calculations logics.
> > > >
> > > >
> > > > Please find more details in the FLIP wiki document [1].
> > > >
> > > > (Please note that the early design doc [2] is out of sync, and it is
> > > > appreciated to have the discussion in this mailing list thread.)
> > > >
> > > >
> > > > Looking forward to your feedbacks.
> > > >
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > >
> > > > [2]
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Yang Wang
Hi xintong,till


> Native and Direct Memory

My point is setting a very large max direct memory size when we do not
differentiate direct and native memory. If the direct memory,including user
direct memory and framework direct memory,could be calculated correctly,then
i am in favor of setting direct memory with fixed value.



> Memory Calculation

I agree with xintong. For Yarn and k8s,we need to check the memory
configurations in client to avoid submitting successfully and failing in
the flink master.


Best,

Yang

Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:

> Thanks for replying, Till.
>
> About MemorySegment, I think you are right that we should not include this
> issue in the scope of this FLIP. This FLIP should concentrate on how to
> configure memory pools for TaskExecutors, with minimum involvement on how
> memory consumers use it.
>
> About direct memory, I think alternative 3 may not having the same over
> reservation issue that alternative 2 does, but at the cost of risk of over
> using memory at the container level, which is not good. My point is that
> both "Task Off-Heap Memory" and "JVM Overhead" are not easy to config. For
> alternative 2, users might configure them higher than what actually needed,
> just to avoid getting a direct OOM. For alternative 3, users do not get
> direct OOM, so they may not config the two options aggressively high. But
> the consequences are risks of overall container memory usage exceeds the
> budget.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <[hidden email]>
> wrote:
>
> > Thanks for proposing this FLIP Xintong.
> >
> > All in all I think it already looks quite good. Concerning the first open
> > question about allocating memory segments, I was wondering whether this
> is
> > strictly necessary to do in the context of this FLIP or whether this
> could
> > be done as a follow up? Without knowing all details, I would be concerned
> > that we would widen the scope of this FLIP too much because we would have
> > to touch all the existing call sites of the MemoryManager where we
> allocate
> > memory segments (this should mainly be batch operators). The addition of
> > the memory reservation call to the MemoryManager should not be affected
> by
> > this and I would hope that this is the only point of interaction a
> > streaming job would have with the MemoryManager.
> >
> > Concerning the second open question about setting or not setting a max
> > direct memory limit, I would also be interested why Yang Wang thinks
> > leaving it open would be best. My concern about this would be that we
> would
> > be in a similar situation as we are now with the RocksDBStateBackend. If
> > the different memory pools are not clearly separated and can spill over
> to
> > a different pool, then it is quite hard to understand what exactly
> causes a
> > process to get killed for using too much memory. This could then easily
> > lead to a similar situation what we have with the cutoff-ratio. So why
> not
> > setting a sane default value for max direct memory and giving the user an
> > option to increase it if he runs into an OOM.
> >
> > @Xintong, how would alternative 2 lead to lower memory utilization than
> > alternative 3 where we set the direct memory to a higher value?
> >
> > Cheers,
> > Till
> >
> > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <[hidden email]>
> wrote:
> >
> > > Thanks for the feedback, Yang.
> > >
> > > Regarding your comments:
> > >
> > > *Native and Direct Memory*
> > > I think setting a very large max direct memory size definitely has some
> > > good sides. E.g., we do not worry about direct OOM, and we don't even
> > need
> > > to allocate managed / network memory with Unsafe.allocate() .
> > > However, there are also some down sides of doing this.
> > >
> > >    - One thing I can think of is that if a task executor container is
> > >    killed due to overusing memory, it could be hard for use to know
> which
> > > part
> > >    of the memory is overused.
> > >    - Another down side is that the JVM never trigger GC due to reaching
> > max
> > >    direct memory limit, because the limit is too high to be reached.
> That
> > >    means we kind of relay on heap memory to trigger GC and release
> direct
> > >    memory. That could be a problem in cases where we have more direct
> > > memory
> > >    usage but not enough heap activity to trigger the GC.
> > >
> > > Maybe you can share your reasons for preferring setting a very large
> > value,
> > > if there are anything else I overlooked.
> > >
> > > *Memory Calculation*
> > > If there is any conflict between multiple configuration that user
> > > explicitly specified, I think we should throw an error.
> > > I think doing checking on the client side is a good idea, so that on
> > Yarn /
> > > K8s we can discover the problem before submitting the Flink cluster,
> > which
> > > is always a good thing.
> > > But we can not only rely on the client side checking, because for
> > > standalone cluster TaskManagers on different machines may have
> different
> > > configurations and the client does see that.
> > > What do you think?
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <[hidden email]>
> wrote:
> > >
> > > > Hi xintong,
> > > >
> > > >
> > > > Thanks for your detailed proposal. After all the memory configuration
> > are
> > > > introduced, it will be more powerful to control the flink memory
> > usage. I
> > > > just have few questions about it.
> > > >
> > > >
> > > >
> > > >    - Native and Direct Memory
> > > >
> > > > We do not differentiate user direct memory and native memory. They
> are
> > > all
> > > > included in task off-heap memory. Right? So i don’t think we could
> not
> > > set
> > > > the -XX:MaxDirectMemorySize properly. I prefer leaving it a very
> large
> > > > value.
> > > >
> > > >
> > > >
> > > >    - Memory Calculation
> > > >
> > > > If the sum of and fine-grained memory(network memory, managed memory,
> > > etc.)
> > > > is larger than total process memory, how do we deal with this
> > situation?
> > > Do
> > > > we need to check the memory configuration in client?
> > > >
> > > > Xintong Song <[hidden email]> 于2019年8月7日周三 下午10:14写道:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > We would like to start a discussion thread on "FLIP-49: Unified
> > Memory
> > > > > Configuration for TaskExecutors"[1], where we describe how to
> improve
> > > > > TaskExecutor memory configurations. The FLIP document is mostly
> based
> > > on
> > > > an
> > > > > early design "Memory Management and Configuration Reloaded"[2] by
> > > > Stephan,
> > > > > with updates from follow-up discussions both online and offline.
> > > > >
> > > > > This FLIP addresses several shortcomings of current (Flink 1.9)
> > > > > TaskExecutor memory configuration.
> > > > >
> > > > >    - Different configuration for Streaming and Batch.
> > > > >    - Complex and difficult configuration of RocksDB in Streaming.
> > > > >    - Complicated, uncertain and hard to understand.
> > > > >
> > > > >
> > > > > Key changes to solve the problems can be summarized as follows.
> > > > >
> > > > >    - Extend memory manager to also account for memory usage by
> state
> > > > >    backends.
> > > > >    - Modify how TaskExecutor memory is partitioned accounted
> > individual
> > > > >    memory reservations and pools.
> > > > >    - Simplify memory configuration options and calculations logics.
> > > > >
> > > > >
> > > > > Please find more details in the FLIP wiki document [1].
> > > > >
> > > > > (Please note that the early design doc [2] is out of sync, and it
> is
> > > > > appreciated to have the discussion in this mailing list thread.)
> > > > >
> > > > >
> > > > > Looking forward to your feedbacks.
> > > > >
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > >
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Till Rohrmann
I guess you have to help me understand the difference between alternative 2
and 3 wrt to memory under utilization Xintong.

- Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory and JVM
Overhead. Then there is the risk that this size is too low resulting in a
lot of garbage collection and potentially an OOM.
- Alternative 3: set XX:MaxDirectMemorySize to something larger than
alternative 2. This would of course reduce the sizes of the other memory
types.

How would alternative 2 now result in an under utilization of memory
compared to alternative 3? If alternative 3 strictly sets a higher max
direct memory size and we use only little, then I would expect that
alternative 3 results in memory under utilization.

Cheers,
Till

On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[hidden email]> wrote:

> Hi xintong,till
>
>
> > Native and Direct Memory
>
> My point is setting a very large max direct memory size when we do not
> differentiate direct and native memory. If the direct memory,including user
> direct memory and framework direct memory,could be calculated
> correctly,then
> i am in favor of setting direct memory with fixed value.
>
>
>
> > Memory Calculation
>
> I agree with xintong. For Yarn and k8s,we need to check the memory
> configurations in client to avoid submitting successfully and failing in
> the flink master.
>
>
> Best,
>
> Yang
>
> Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
>
> > Thanks for replying, Till.
> >
> > About MemorySegment, I think you are right that we should not include
> this
> > issue in the scope of this FLIP. This FLIP should concentrate on how to
> > configure memory pools for TaskExecutors, with minimum involvement on how
> > memory consumers use it.
> >
> > About direct memory, I think alternative 3 may not having the same over
> > reservation issue that alternative 2 does, but at the cost of risk of
> over
> > using memory at the container level, which is not good. My point is that
> > both "Task Off-Heap Memory" and "JVM Overhead" are not easy to config.
> For
> > alternative 2, users might configure them higher than what actually
> needed,
> > just to avoid getting a direct OOM. For alternative 3, users do not get
> > direct OOM, so they may not config the two options aggressively high. But
> > the consequences are risks of overall container memory usage exceeds the
> > budget.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > Thanks for proposing this FLIP Xintong.
> > >
> > > All in all I think it already looks quite good. Concerning the first
> open
> > > question about allocating memory segments, I was wondering whether this
> > is
> > > strictly necessary to do in the context of this FLIP or whether this
> > could
> > > be done as a follow up? Without knowing all details, I would be
> concerned
> > > that we would widen the scope of this FLIP too much because we would
> have
> > > to touch all the existing call sites of the MemoryManager where we
> > allocate
> > > memory segments (this should mainly be batch operators). The addition
> of
> > > the memory reservation call to the MemoryManager should not be affected
> > by
> > > this and I would hope that this is the only point of interaction a
> > > streaming job would have with the MemoryManager.
> > >
> > > Concerning the second open question about setting or not setting a max
> > > direct memory limit, I would also be interested why Yang Wang thinks
> > > leaving it open would be best. My concern about this would be that we
> > would
> > > be in a similar situation as we are now with the RocksDBStateBackend.
> If
> > > the different memory pools are not clearly separated and can spill over
> > to
> > > a different pool, then it is quite hard to understand what exactly
> > causes a
> > > process to get killed for using too much memory. This could then easily
> > > lead to a similar situation what we have with the cutoff-ratio. So why
> > not
> > > setting a sane default value for max direct memory and giving the user
> an
> > > option to increase it if he runs into an OOM.
> > >
> > > @Xintong, how would alternative 2 lead to lower memory utilization than
> > > alternative 3 where we set the direct memory to a higher value?
> > >
> > > Cheers,
> > > Till
> > >
> > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <[hidden email]>
> > wrote:
> > >
> > > > Thanks for the feedback, Yang.
> > > >
> > > > Regarding your comments:
> > > >
> > > > *Native and Direct Memory*
> > > > I think setting a very large max direct memory size definitely has
> some
> > > > good sides. E.g., we do not worry about direct OOM, and we don't even
> > > need
> > > > to allocate managed / network memory with Unsafe.allocate() .
> > > > However, there are also some down sides of doing this.
> > > >
> > > >    - One thing I can think of is that if a task executor container is
> > > >    killed due to overusing memory, it could be hard for use to know
> > which
> > > > part
> > > >    of the memory is overused.
> > > >    - Another down side is that the JVM never trigger GC due to
> reaching
> > > max
> > > >    direct memory limit, because the limit is too high to be reached.
> > That
> > > >    means we kind of relay on heap memory to trigger GC and release
> > direct
> > > >    memory. That could be a problem in cases where we have more direct
> > > > memory
> > > >    usage but not enough heap activity to trigger the GC.
> > > >
> > > > Maybe you can share your reasons for preferring setting a very large
> > > value,
> > > > if there are anything else I overlooked.
> > > >
> > > > *Memory Calculation*
> > > > If there is any conflict between multiple configuration that user
> > > > explicitly specified, I think we should throw an error.
> > > > I think doing checking on the client side is a good idea, so that on
> > > Yarn /
> > > > K8s we can discover the problem before submitting the Flink cluster,
> > > which
> > > > is always a good thing.
> > > > But we can not only rely on the client side checking, because for
> > > > standalone cluster TaskManagers on different machines may have
> > different
> > > > configurations and the client does see that.
> > > > What do you think?
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <[hidden email]>
> > wrote:
> > > >
> > > > > Hi xintong,
> > > > >
> > > > >
> > > > > Thanks for your detailed proposal. After all the memory
> configuration
> > > are
> > > > > introduced, it will be more powerful to control the flink memory
> > > usage. I
> > > > > just have few questions about it.
> > > > >
> > > > >
> > > > >
> > > > >    - Native and Direct Memory
> > > > >
> > > > > We do not differentiate user direct memory and native memory. They
> > are
> > > > all
> > > > > included in task off-heap memory. Right? So i don’t think we could
> > not
> > > > set
> > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving it a very
> > large
> > > > > value.
> > > > >
> > > > >
> > > > >
> > > > >    - Memory Calculation
> > > > >
> > > > > If the sum of and fine-grained memory(network memory, managed
> memory,
> > > > etc.)
> > > > > is larger than total process memory, how do we deal with this
> > > situation?
> > > > Do
> > > > > we need to check the memory configuration in client?
> > > > >
> > > > > Xintong Song <[hidden email]> 于2019年8月7日周三 下午10:14写道:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > We would like to start a discussion thread on "FLIP-49: Unified
> > > Memory
> > > > > > Configuration for TaskExecutors"[1], where we describe how to
> > improve
> > > > > > TaskExecutor memory configurations. The FLIP document is mostly
> > based
> > > > on
> > > > > an
> > > > > > early design "Memory Management and Configuration Reloaded"[2] by
> > > > > Stephan,
> > > > > > with updates from follow-up discussions both online and offline.
> > > > > >
> > > > > > This FLIP addresses several shortcomings of current (Flink 1.9)
> > > > > > TaskExecutor memory configuration.
> > > > > >
> > > > > >    - Different configuration for Streaming and Batch.
> > > > > >    - Complex and difficult configuration of RocksDB in Streaming.
> > > > > >    - Complicated, uncertain and hard to understand.
> > > > > >
> > > > > >
> > > > > > Key changes to solve the problems can be summarized as follows.
> > > > > >
> > > > > >    - Extend memory manager to also account for memory usage by
> > state
> > > > > >    backends.
> > > > > >    - Modify how TaskExecutor memory is partitioned accounted
> > > individual
> > > > > >    memory reservations and pools.
> > > > > >    - Simplify memory configuration options and calculations
> logics.
> > > > > >
> > > > > >
> > > > > > Please find more details in the FLIP wiki document [1].
> > > > > >
> > > > > > (Please note that the early design doc [2] is out of sync, and it
> > is
> > > > > > appreciated to have the discussion in this mailing list thread.)
> > > > > >
> > > > > >
> > > > > > Looking forward to your feedbacks.
> > > > > >
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > >
> > > > > > [2]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Xintong Song
Let me explain this with a concrete example Till.

Let's say we have the following scenario.

Total Process Memory: 1GB
JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed Memory and
Network Memory): 800MB


For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
For alternative 3, we set -XX:MaxDirectMemorySize to a very large value,
let's say 1TB.

If the actual direct memory usage of Task Off-Heap Memory and JVM Overhead
do not exceed 200MB, then alternative 2 and alternative 3 should have the
same utility. Setting larger -XX:MaxDirectMemorySize will not reduce the
sizes of the other memory pools.

If the actual direct memory usage of Task Off-Heap Memory and JVM
Overhead potentially exceed 200MB, then

   - Alternative 2 suffers from frequent OOM. To avoid that, the only thing
   user can do is to modify the configuration and increase JVM Direct Memory
   (Task Off-Heap Memory + JVM Overhead). Let's say that user increases JVM
   Direct Memory to 250MB, this will reduce the total size of other memory
   pools to 750MB, given the total process memory remains 1GB.
   - For alternative 3, there is no chance of direct OOM. There are chances
   of exceeding the total process memory limit, but given that the process may
   not use up all the reserved native memory (Off-Heap Managed Memory, Network
   Memory, JVM Metaspace), if the actual direct memory usage is slightly above
   yet very close to 200MB, user probably do not need to change the
   configurations.

Therefore, I think from the user's perspective, a feasible configuration
for alternative 2 may lead to lower resource utilization compared to
alternative 3.

Thank you~

Xintong Song



On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <[hidden email]> wrote:

> I guess you have to help me understand the difference between alternative 2
> and 3 wrt to memory under utilization Xintong.
>
> - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory and JVM
> Overhead. Then there is the risk that this size is too low resulting in a
> lot of garbage collection and potentially an OOM.
> - Alternative 3: set XX:MaxDirectMemorySize to something larger than
> alternative 2. This would of course reduce the sizes of the other memory
> types.
>
> How would alternative 2 now result in an under utilization of memory
> compared to alternative 3? If alternative 3 strictly sets a higher max
> direct memory size and we use only little, then I would expect that
> alternative 3 results in memory under utilization.
>
> Cheers,
> Till
>
> On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[hidden email]> wrote:
>
> > Hi xintong,till
> >
> >
> > > Native and Direct Memory
> >
> > My point is setting a very large max direct memory size when we do not
> > differentiate direct and native memory. If the direct memory,including
> user
> > direct memory and framework direct memory,could be calculated
> > correctly,then
> > i am in favor of setting direct memory with fixed value.
> >
> >
> >
> > > Memory Calculation
> >
> > I agree with xintong. For Yarn and k8s,we need to check the memory
> > configurations in client to avoid submitting successfully and failing in
> > the flink master.
> >
> >
> > Best,
> >
> > Yang
> >
> > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> >
> > > Thanks for replying, Till.
> > >
> > > About MemorySegment, I think you are right that we should not include
> > this
> > > issue in the scope of this FLIP. This FLIP should concentrate on how to
> > > configure memory pools for TaskExecutors, with minimum involvement on
> how
> > > memory consumers use it.
> > >
> > > About direct memory, I think alternative 3 may not having the same over
> > > reservation issue that alternative 2 does, but at the cost of risk of
> > over
> > > using memory at the container level, which is not good. My point is
> that
> > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy to config.
> > For
> > > alternative 2, users might configure them higher than what actually
> > needed,
> > > just to avoid getting a direct OOM. For alternative 3, users do not get
> > > direct OOM, so they may not config the two options aggressively high.
> But
> > > the consequences are risks of overall container memory usage exceeds
> the
> > > budget.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for proposing this FLIP Xintong.
> > > >
> > > > All in all I think it already looks quite good. Concerning the first
> > open
> > > > question about allocating memory segments, I was wondering whether
> this
> > > is
> > > > strictly necessary to do in the context of this FLIP or whether this
> > > could
> > > > be done as a follow up? Without knowing all details, I would be
> > concerned
> > > > that we would widen the scope of this FLIP too much because we would
> > have
> > > > to touch all the existing call sites of the MemoryManager where we
> > > allocate
> > > > memory segments (this should mainly be batch operators). The addition
> > of
> > > > the memory reservation call to the MemoryManager should not be
> affected
> > > by
> > > > this and I would hope that this is the only point of interaction a
> > > > streaming job would have with the MemoryManager.
> > > >
> > > > Concerning the second open question about setting or not setting a
> max
> > > > direct memory limit, I would also be interested why Yang Wang thinks
> > > > leaving it open would be best. My concern about this would be that we
> > > would
> > > > be in a similar situation as we are now with the RocksDBStateBackend.
> > If
> > > > the different memory pools are not clearly separated and can spill
> over
> > > to
> > > > a different pool, then it is quite hard to understand what exactly
> > > causes a
> > > > process to get killed for using too much memory. This could then
> easily
> > > > lead to a similar situation what we have with the cutoff-ratio. So
> why
> > > not
> > > > setting a sane default value for max direct memory and giving the
> user
> > an
> > > > option to increase it if he runs into an OOM.
> > > >
> > > > @Xintong, how would alternative 2 lead to lower memory utilization
> than
> > > > alternative 3 where we set the direct memory to a higher value?
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <[hidden email]>
> > > wrote:
> > > >
> > > > > Thanks for the feedback, Yang.
> > > > >
> > > > > Regarding your comments:
> > > > >
> > > > > *Native and Direct Memory*
> > > > > I think setting a very large max direct memory size definitely has
> > some
> > > > > good sides. E.g., we do not worry about direct OOM, and we don't
> even
> > > > need
> > > > > to allocate managed / network memory with Unsafe.allocate() .
> > > > > However, there are also some down sides of doing this.
> > > > >
> > > > >    - One thing I can think of is that if a task executor container
> is
> > > > >    killed due to overusing memory, it could be hard for use to know
> > > which
> > > > > part
> > > > >    of the memory is overused.
> > > > >    - Another down side is that the JVM never trigger GC due to
> > reaching
> > > > max
> > > > >    direct memory limit, because the limit is too high to be
> reached.
> > > That
> > > > >    means we kind of relay on heap memory to trigger GC and release
> > > direct
> > > > >    memory. That could be a problem in cases where we have more
> direct
> > > > > memory
> > > > >    usage but not enough heap activity to trigger the GC.
> > > > >
> > > > > Maybe you can share your reasons for preferring setting a very
> large
> > > > value,
> > > > > if there are anything else I overlooked.
> > > > >
> > > > > *Memory Calculation*
> > > > > If there is any conflict between multiple configuration that user
> > > > > explicitly specified, I think we should throw an error.
> > > > > I think doing checking on the client side is a good idea, so that
> on
> > > > Yarn /
> > > > > K8s we can discover the problem before submitting the Flink
> cluster,
> > > > which
> > > > > is always a good thing.
> > > > > But we can not only rely on the client side checking, because for
> > > > > standalone cluster TaskManagers on different machines may have
> > > different
> > > > > configurations and the client does see that.
> > > > > What do you think?
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Hi xintong,
> > > > > >
> > > > > >
> > > > > > Thanks for your detailed proposal. After all the memory
> > configuration
> > > > are
> > > > > > introduced, it will be more powerful to control the flink memory
> > > > usage. I
> > > > > > just have few questions about it.
> > > > > >
> > > > > >
> > > > > >
> > > > > >    - Native and Direct Memory
> > > > > >
> > > > > > We do not differentiate user direct memory and native memory.
> They
> > > are
> > > > > all
> > > > > > included in task off-heap memory. Right? So i don’t think we
> could
> > > not
> > > > > set
> > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving it a very
> > > large
> > > > > > value.
> > > > > >
> > > > > >
> > > > > >
> > > > > >    - Memory Calculation
> > > > > >
> > > > > > If the sum of and fine-grained memory(network memory, managed
> > memory,
> > > > > etc.)
> > > > > > is larger than total process memory, how do we deal with this
> > > > situation?
> > > > > Do
> > > > > > we need to check the memory configuration in client?
> > > > > >
> > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三 下午10:14写道:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > We would like to start a discussion thread on "FLIP-49: Unified
> > > > Memory
> > > > > > > Configuration for TaskExecutors"[1], where we describe how to
> > > improve
> > > > > > > TaskExecutor memory configurations. The FLIP document is mostly
> > > based
> > > > > on
> > > > > > an
> > > > > > > early design "Memory Management and Configuration Reloaded"[2]
> by
> > > > > > Stephan,
> > > > > > > with updates from follow-up discussions both online and
> offline.
> > > > > > >
> > > > > > > This FLIP addresses several shortcomings of current (Flink 1.9)
> > > > > > > TaskExecutor memory configuration.
> > > > > > >
> > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > >    - Complex and difficult configuration of RocksDB in
> Streaming.
> > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > >
> > > > > > >
> > > > > > > Key changes to solve the problems can be summarized as follows.
> > > > > > >
> > > > > > >    - Extend memory manager to also account for memory usage by
> > > state
> > > > > > >    backends.
> > > > > > >    - Modify how TaskExecutor memory is partitioned accounted
> > > > individual
> > > > > > >    memory reservations and pools.
> > > > > > >    - Simplify memory configuration options and calculations
> > logics.
> > > > > > >
> > > > > > >
> > > > > > > Please find more details in the FLIP wiki document [1].
> > > > > > >
> > > > > > > (Please note that the early design doc [2] is out of sync, and
> it
> > > is
> > > > > > > appreciated to have the discussion in this mailing list
> thread.)
> > > > > > >
> > > > > > >
> > > > > > > Looking forward to your feedbacks.
> > > > > > >
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > >
> > > > > > > [2]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Till Rohrmann
Thanks for the clarification Xintong. I understand the two alternatives now.

I would be in favour of option 2 because it makes things explicit. If we
don't limit the direct memory, I fear that we might end up in a similar
situation as we are currently in: The user might see that her process gets
killed by the OS and does not know why this is the case. Consequently, she
tries to decrease the process memory size (similar to increasing the cutoff
ratio) in order to accommodate for the extra direct memory. Even worse, she
tries to decrease memory budgets which are not fully used and hence won't
change the overall memory consumption.

Cheers,
Till

On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <[hidden email]> wrote:

> Let me explain this with a concrete example Till.
>
> Let's say we have the following scenario.
>
> Total Process Memory: 1GB
> JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed Memory and
> Network Memory): 800MB
>
>
> For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> For alternative 3, we set -XX:MaxDirectMemorySize to a very large value,
> let's say 1TB.
>
> If the actual direct memory usage of Task Off-Heap Memory and JVM Overhead
> do not exceed 200MB, then alternative 2 and alternative 3 should have the
> same utility. Setting larger -XX:MaxDirectMemorySize will not reduce the
> sizes of the other memory pools.
>
> If the actual direct memory usage of Task Off-Heap Memory and JVM
> Overhead potentially exceed 200MB, then
>
>    - Alternative 2 suffers from frequent OOM. To avoid that, the only thing
>    user can do is to modify the configuration and increase JVM Direct
> Memory
>    (Task Off-Heap Memory + JVM Overhead). Let's say that user increases JVM
>    Direct Memory to 250MB, this will reduce the total size of other memory
>    pools to 750MB, given the total process memory remains 1GB.
>    - For alternative 3, there is no chance of direct OOM. There are chances
>    of exceeding the total process memory limit, but given that the process
> may
>    not use up all the reserved native memory (Off-Heap Managed Memory,
> Network
>    Memory, JVM Metaspace), if the actual direct memory usage is slightly
> above
>    yet very close to 200MB, user probably do not need to change the
>    configurations.
>
> Therefore, I think from the user's perspective, a feasible configuration
> for alternative 2 may lead to lower resource utilization compared to
> alternative 3.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <[hidden email]>
> wrote:
>
> > I guess you have to help me understand the difference between
> alternative 2
> > and 3 wrt to memory under utilization Xintong.
> >
> > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory and
> JVM
> > Overhead. Then there is the risk that this size is too low resulting in a
> > lot of garbage collection and potentially an OOM.
> > - Alternative 3: set XX:MaxDirectMemorySize to something larger than
> > alternative 2. This would of course reduce the sizes of the other memory
> > types.
> >
> > How would alternative 2 now result in an under utilization of memory
> > compared to alternative 3? If alternative 3 strictly sets a higher max
> > direct memory size and we use only little, then I would expect that
> > alternative 3 results in memory under utilization.
> >
> > Cheers,
> > Till
> >
> > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[hidden email]> wrote:
> >
> > > Hi xintong,till
> > >
> > >
> > > > Native and Direct Memory
> > >
> > > My point is setting a very large max direct memory size when we do not
> > > differentiate direct and native memory. If the direct memory,including
> > user
> > > direct memory and framework direct memory,could be calculated
> > > correctly,then
> > > i am in favor of setting direct memory with fixed value.
> > >
> > >
> > >
> > > > Memory Calculation
> > >
> > > I agree with xintong. For Yarn and k8s,we need to check the memory
> > > configurations in client to avoid submitting successfully and failing
> in
> > > the flink master.
> > >
> > >
> > > Best,
> > >
> > > Yang
> > >
> > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > >
> > > > Thanks for replying, Till.
> > > >
> > > > About MemorySegment, I think you are right that we should not include
> > > this
> > > > issue in the scope of this FLIP. This FLIP should concentrate on how
> to
> > > > configure memory pools for TaskExecutors, with minimum involvement on
> > how
> > > > memory consumers use it.
> > > >
> > > > About direct memory, I think alternative 3 may not having the same
> over
> > > > reservation issue that alternative 2 does, but at the cost of risk of
> > > over
> > > > using memory at the container level, which is not good. My point is
> > that
> > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy to
> config.
> > > For
> > > > alternative 2, users might configure them higher than what actually
> > > needed,
> > > > just to avoid getting a direct OOM. For alternative 3, users do not
> get
> > > > direct OOM, so they may not config the two options aggressively high.
> > But
> > > > the consequences are risks of overall container memory usage exceeds
> > the
> > > > budget.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <[hidden email]>
> > > > wrote:
> > > >
> > > > > Thanks for proposing this FLIP Xintong.
> > > > >
> > > > > All in all I think it already looks quite good. Concerning the
> first
> > > open
> > > > > question about allocating memory segments, I was wondering whether
> > this
> > > > is
> > > > > strictly necessary to do in the context of this FLIP or whether
> this
> > > > could
> > > > > be done as a follow up? Without knowing all details, I would be
> > > concerned
> > > > > that we would widen the scope of this FLIP too much because we
> would
> > > have
> > > > > to touch all the existing call sites of the MemoryManager where we
> > > > allocate
> > > > > memory segments (this should mainly be batch operators). The
> addition
> > > of
> > > > > the memory reservation call to the MemoryManager should not be
> > affected
> > > > by
> > > > > this and I would hope that this is the only point of interaction a
> > > > > streaming job would have with the MemoryManager.
> > > > >
> > > > > Concerning the second open question about setting or not setting a
> > max
> > > > > direct memory limit, I would also be interested why Yang Wang
> thinks
> > > > > leaving it open would be best. My concern about this would be that
> we
> > > > would
> > > > > be in a similar situation as we are now with the
> RocksDBStateBackend.
> > > If
> > > > > the different memory pools are not clearly separated and can spill
> > over
> > > > to
> > > > > a different pool, then it is quite hard to understand what exactly
> > > > causes a
> > > > > process to get killed for using too much memory. This could then
> > easily
> > > > > lead to a similar situation what we have with the cutoff-ratio. So
> > why
> > > > not
> > > > > setting a sane default value for max direct memory and giving the
> > user
> > > an
> > > > > option to increase it if he runs into an OOM.
> > > > >
> > > > > @Xintong, how would alternative 2 lead to lower memory utilization
> > than
> > > > > alternative 3 where we set the direct memory to a higher value?
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <[hidden email]
> >
> > > > wrote:
> > > > >
> > > > > > Thanks for the feedback, Yang.
> > > > > >
> > > > > > Regarding your comments:
> > > > > >
> > > > > > *Native and Direct Memory*
> > > > > > I think setting a very large max direct memory size definitely
> has
> > > some
> > > > > > good sides. E.g., we do not worry about direct OOM, and we don't
> > even
> > > > > need
> > > > > > to allocate managed / network memory with Unsafe.allocate() .
> > > > > > However, there are also some down sides of doing this.
> > > > > >
> > > > > >    - One thing I can think of is that if a task executor
> container
> > is
> > > > > >    killed due to overusing memory, it could be hard for use to
> know
> > > > which
> > > > > > part
> > > > > >    of the memory is overused.
> > > > > >    - Another down side is that the JVM never trigger GC due to
> > > reaching
> > > > > max
> > > > > >    direct memory limit, because the limit is too high to be
> > reached.
> > > > That
> > > > > >    means we kind of relay on heap memory to trigger GC and
> release
> > > > direct
> > > > > >    memory. That could be a problem in cases where we have more
> > direct
> > > > > > memory
> > > > > >    usage but not enough heap activity to trigger the GC.
> > > > > >
> > > > > > Maybe you can share your reasons for preferring setting a very
> > large
> > > > > value,
> > > > > > if there are anything else I overlooked.
> > > > > >
> > > > > > *Memory Calculation*
> > > > > > If there is any conflict between multiple configuration that user
> > > > > > explicitly specified, I think we should throw an error.
> > > > > > I think doing checking on the client side is a good idea, so that
> > on
> > > > > Yarn /
> > > > > > K8s we can discover the problem before submitting the Flink
> > cluster,
> > > > > which
> > > > > > is always a good thing.
> > > > > > But we can not only rely on the client side checking, because for
> > > > > > standalone cluster TaskManagers on different machines may have
> > > > different
> > > > > > configurations and the client does see that.
> > > > > > What do you think?
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <[hidden email]>
> > > > wrote:
> > > > > >
> > > > > > > Hi xintong,
> > > > > > >
> > > > > > >
> > > > > > > Thanks for your detailed proposal. After all the memory
> > > configuration
> > > > > are
> > > > > > > introduced, it will be more powerful to control the flink
> memory
> > > > > usage. I
> > > > > > > just have few questions about it.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >    - Native and Direct Memory
> > > > > > >
> > > > > > > We do not differentiate user direct memory and native memory.
> > They
> > > > are
> > > > > > all
> > > > > > > included in task off-heap memory. Right? So i don’t think we
> > could
> > > > not
> > > > > > set
> > > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving it a
> very
> > > > large
> > > > > > > value.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >    - Memory Calculation
> > > > > > >
> > > > > > > If the sum of and fine-grained memory(network memory, managed
> > > memory,
> > > > > > etc.)
> > > > > > > is larger than total process memory, how do we deal with this
> > > > > situation?
> > > > > > Do
> > > > > > > we need to check the memory configuration in client?
> > > > > > >
> > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三 下午10:14写道:
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > We would like to start a discussion thread on "FLIP-49:
> Unified
> > > > > Memory
> > > > > > > > Configuration for TaskExecutors"[1], where we describe how to
> > > > improve
> > > > > > > > TaskExecutor memory configurations. The FLIP document is
> mostly
> > > > based
> > > > > > on
> > > > > > > an
> > > > > > > > early design "Memory Management and Configuration
> Reloaded"[2]
> > by
> > > > > > > Stephan,
> > > > > > > > with updates from follow-up discussions both online and
> > offline.
> > > > > > > >
> > > > > > > > This FLIP addresses several shortcomings of current (Flink
> 1.9)
> > > > > > > > TaskExecutor memory configuration.
> > > > > > > >
> > > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > > >    - Complex and difficult configuration of RocksDB in
> > Streaming.
> > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > >
> > > > > > > >
> > > > > > > > Key changes to solve the problems can be summarized as
> follows.
> > > > > > > >
> > > > > > > >    - Extend memory manager to also account for memory usage
> by
> > > > state
> > > > > > > >    backends.
> > > > > > > >    - Modify how TaskExecutor memory is partitioned accounted
> > > > > individual
> > > > > > > >    memory reservations and pools.
> > > > > > > >    - Simplify memory configuration options and calculations
> > > logics.
> > > > > > > >
> > > > > > > >
> > > > > > > > Please find more details in the FLIP wiki document [1].
> > > > > > > >
> > > > > > > > (Please note that the early design doc [2] is out of sync,
> and
> > it
> > > > is
> > > > > > > > appreciated to have the discussion in this mailing list
> > thread.)
> > > > > > > >
> > > > > > > >
> > > > > > > > Looking forward to your feedbacks.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > >
> > > > > > > > [2]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Xintong Song
Thanks for sharing your opinion Till.

I'm also in favor of alternative 2. I was wondering whether we can avoid
using Unsafe.allocate() for off-heap managed memory and network memory with
alternative 3. But after giving it a second thought, I think even for
alternative 3 using direct memory for off-heap managed memory could cause
problems.

Hi Yang,

Regarding your concern, I think what proposed in this FLIP it to have both
off-heap managed memory and network memory allocated through
Unsafe.allocate(), which means they are practically native memory and not
limited by JVM max direct memory. The only parts of memory limited by JVM
max direct memory are task off-heap memory and JVM overhead, which are
exactly alternative 2 suggests to set the JVM max direct memory to.

Thank you~

Xintong Song



On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]> wrote:

> Thanks for the clarification Xintong. I understand the two alternatives
> now.
>
> I would be in favour of option 2 because it makes things explicit. If we
> don't limit the direct memory, I fear that we might end up in a similar
> situation as we are currently in: The user might see that her process gets
> killed by the OS and does not know why this is the case. Consequently, she
> tries to decrease the process memory size (similar to increasing the cutoff
> ratio) in order to accommodate for the extra direct memory. Even worse, she
> tries to decrease memory budgets which are not fully used and hence won't
> change the overall memory consumption.
>
> Cheers,
> Till
>
> On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <[hidden email]>
> wrote:
>
> > Let me explain this with a concrete example Till.
> >
> > Let's say we have the following scenario.
> >
> > Total Process Memory: 1GB
> > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed Memory and
> > Network Memory): 800MB
> >
> >
> > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > For alternative 3, we set -XX:MaxDirectMemorySize to a very large value,
> > let's say 1TB.
> >
> > If the actual direct memory usage of Task Off-Heap Memory and JVM
> Overhead
> > do not exceed 200MB, then alternative 2 and alternative 3 should have the
> > same utility. Setting larger -XX:MaxDirectMemorySize will not reduce the
> > sizes of the other memory pools.
> >
> > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > Overhead potentially exceed 200MB, then
> >
> >    - Alternative 2 suffers from frequent OOM. To avoid that, the only
> thing
> >    user can do is to modify the configuration and increase JVM Direct
> > Memory
> >    (Task Off-Heap Memory + JVM Overhead). Let's say that user increases
> JVM
> >    Direct Memory to 250MB, this will reduce the total size of other
> memory
> >    pools to 750MB, given the total process memory remains 1GB.
> >    - For alternative 3, there is no chance of direct OOM. There are
> chances
> >    of exceeding the total process memory limit, but given that the
> process
> > may
> >    not use up all the reserved native memory (Off-Heap Managed Memory,
> > Network
> >    Memory, JVM Metaspace), if the actual direct memory usage is slightly
> > above
> >    yet very close to 200MB, user probably do not need to change the
> >    configurations.
> >
> > Therefore, I think from the user's perspective, a feasible configuration
> > for alternative 2 may lead to lower resource utilization compared to
> > alternative 3.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > I guess you have to help me understand the difference between
> > alternative 2
> > > and 3 wrt to memory under utilization Xintong.
> > >
> > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory and
> > JVM
> > > Overhead. Then there is the risk that this size is too low resulting
> in a
> > > lot of garbage collection and potentially an OOM.
> > > - Alternative 3: set XX:MaxDirectMemorySize to something larger than
> > > alternative 2. This would of course reduce the sizes of the other
> memory
> > > types.
> > >
> > > How would alternative 2 now result in an under utilization of memory
> > > compared to alternative 3? If alternative 3 strictly sets a higher max
> > > direct memory size and we use only little, then I would expect that
> > > alternative 3 results in memory under utilization.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[hidden email]>
> wrote:
> > >
> > > > Hi xintong,till
> > > >
> > > >
> > > > > Native and Direct Memory
> > > >
> > > > My point is setting a very large max direct memory size when we do
> not
> > > > differentiate direct and native memory. If the direct
> memory,including
> > > user
> > > > direct memory and framework direct memory,could be calculated
> > > > correctly,then
> > > > i am in favor of setting direct memory with fixed value.
> > > >
> > > >
> > > >
> > > > > Memory Calculation
> > > >
> > > > I agree with xintong. For Yarn and k8s,we need to check the memory
> > > > configurations in client to avoid submitting successfully and failing
> > in
> > > > the flink master.
> > > >
> > > >
> > > > Best,
> > > >
> > > > Yang
> > > >
> > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > >
> > > > > Thanks for replying, Till.
> > > > >
> > > > > About MemorySegment, I think you are right that we should not
> include
> > > > this
> > > > > issue in the scope of this FLIP. This FLIP should concentrate on
> how
> > to
> > > > > configure memory pools for TaskExecutors, with minimum involvement
> on
> > > how
> > > > > memory consumers use it.
> > > > >
> > > > > About direct memory, I think alternative 3 may not having the same
> > over
> > > > > reservation issue that alternative 2 does, but at the cost of risk
> of
> > > > over
> > > > > using memory at the container level, which is not good. My point is
> > > that
> > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy to
> > config.
> > > > For
> > > > > alternative 2, users might configure them higher than what actually
> > > > needed,
> > > > > just to avoid getting a direct OOM. For alternative 3, users do not
> > get
> > > > > direct OOM, so they may not config the two options aggressively
> high.
> > > But
> > > > > the consequences are risks of overall container memory usage
> exceeds
> > > the
> > > > > budget.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Thanks for proposing this FLIP Xintong.
> > > > > >
> > > > > > All in all I think it already looks quite good. Concerning the
> > first
> > > > open
> > > > > > question about allocating memory segments, I was wondering
> whether
> > > this
> > > > > is
> > > > > > strictly necessary to do in the context of this FLIP or whether
> > this
> > > > > could
> > > > > > be done as a follow up? Without knowing all details, I would be
> > > > concerned
> > > > > > that we would widen the scope of this FLIP too much because we
> > would
> > > > have
> > > > > > to touch all the existing call sites of the MemoryManager where
> we
> > > > > allocate
> > > > > > memory segments (this should mainly be batch operators). The
> > addition
> > > > of
> > > > > > the memory reservation call to the MemoryManager should not be
> > > affected
> > > > > by
> > > > > > this and I would hope that this is the only point of interaction
> a
> > > > > > streaming job would have with the MemoryManager.
> > > > > >
> > > > > > Concerning the second open question about setting or not setting
> a
> > > max
> > > > > > direct memory limit, I would also be interested why Yang Wang
> > thinks
> > > > > > leaving it open would be best. My concern about this would be
> that
> > we
> > > > > would
> > > > > > be in a similar situation as we are now with the
> > RocksDBStateBackend.
> > > > If
> > > > > > the different memory pools are not clearly separated and can
> spill
> > > over
> > > > > to
> > > > > > a different pool, then it is quite hard to understand what
> exactly
> > > > > causes a
> > > > > > process to get killed for using too much memory. This could then
> > > easily
> > > > > > lead to a similar situation what we have with the cutoff-ratio.
> So
> > > why
> > > > > not
> > > > > > setting a sane default value for max direct memory and giving the
> > > user
> > > > an
> > > > > > option to increase it if he runs into an OOM.
> > > > > >
> > > > > > @Xintong, how would alternative 2 lead to lower memory
> utilization
> > > than
> > > > > > alternative 3 where we set the direct memory to a higher value?
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> [hidden email]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Thanks for the feedback, Yang.
> > > > > > >
> > > > > > > Regarding your comments:
> > > > > > >
> > > > > > > *Native and Direct Memory*
> > > > > > > I think setting a very large max direct memory size definitely
> > has
> > > > some
> > > > > > > good sides. E.g., we do not worry about direct OOM, and we
> don't
> > > even
> > > > > > need
> > > > > > > to allocate managed / network memory with Unsafe.allocate() .
> > > > > > > However, there are also some down sides of doing this.
> > > > > > >
> > > > > > >    - One thing I can think of is that if a task executor
> > container
> > > is
> > > > > > >    killed due to overusing memory, it could be hard for use to
> > know
> > > > > which
> > > > > > > part
> > > > > > >    of the memory is overused.
> > > > > > >    - Another down side is that the JVM never trigger GC due to
> > > > reaching
> > > > > > max
> > > > > > >    direct memory limit, because the limit is too high to be
> > > reached.
> > > > > That
> > > > > > >    means we kind of relay on heap memory to trigger GC and
> > release
> > > > > direct
> > > > > > >    memory. That could be a problem in cases where we have more
> > > direct
> > > > > > > memory
> > > > > > >    usage but not enough heap activity to trigger the GC.
> > > > > > >
> > > > > > > Maybe you can share your reasons for preferring setting a very
> > > large
> > > > > > value,
> > > > > > > if there are anything else I overlooked.
> > > > > > >
> > > > > > > *Memory Calculation*
> > > > > > > If there is any conflict between multiple configuration that
> user
> > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > I think doing checking on the client side is a good idea, so
> that
> > > on
> > > > > > Yarn /
> > > > > > > K8s we can discover the problem before submitting the Flink
> > > cluster,
> > > > > > which
> > > > > > > is always a good thing.
> > > > > > > But we can not only rely on the client side checking, because
> for
> > > > > > > standalone cluster TaskManagers on different machines may have
> > > > > different
> > > > > > > configurations and the client does see that.
> > > > > > > What do you think?
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> [hidden email]>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi xintong,
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks for your detailed proposal. After all the memory
> > > > configuration
> > > > > > are
> > > > > > > > introduced, it will be more powerful to control the flink
> > memory
> > > > > > usage. I
> > > > > > > > just have few questions about it.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >    - Native and Direct Memory
> > > > > > > >
> > > > > > > > We do not differentiate user direct memory and native memory.
> > > They
> > > > > are
> > > > > > > all
> > > > > > > > included in task off-heap memory. Right? So i don’t think we
> > > could
> > > > > not
> > > > > > > set
> > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving it a
> > very
> > > > > large
> > > > > > > > value.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >    - Memory Calculation
> > > > > > > >
> > > > > > > > If the sum of and fine-grained memory(network memory, managed
> > > > memory,
> > > > > > > etc.)
> > > > > > > > is larger than total process memory, how do we deal with this
> > > > > > situation?
> > > > > > > Do
> > > > > > > > we need to check the memory configuration in client?
> > > > > > > >
> > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三 下午10:14写道:
> > > > > > > >
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > We would like to start a discussion thread on "FLIP-49:
> > Unified
> > > > > > Memory
> > > > > > > > > Configuration for TaskExecutors"[1], where we describe how
> to
> > > > > improve
> > > > > > > > > TaskExecutor memory configurations. The FLIP document is
> > mostly
> > > > > based
> > > > > > > on
> > > > > > > > an
> > > > > > > > > early design "Memory Management and Configuration
> > Reloaded"[2]
> > > by
> > > > > > > > Stephan,
> > > > > > > > > with updates from follow-up discussions both online and
> > > offline.
> > > > > > > > >
> > > > > > > > > This FLIP addresses several shortcomings of current (Flink
> > 1.9)
> > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > >
> > > > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > > > >    - Complex and difficult configuration of RocksDB in
> > > Streaming.
> > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Key changes to solve the problems can be summarized as
> > follows.
> > > > > > > > >
> > > > > > > > >    - Extend memory manager to also account for memory usage
> > by
> > > > > state
> > > > > > > > >    backends.
> > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> accounted
> > > > > > individual
> > > > > > > > >    memory reservations and pools.
> > > > > > > > >    - Simplify memory configuration options and calculations
> > > > logics.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Please find more details in the FLIP wiki document [1].
> > > > > > > > >
> > > > > > > > > (Please note that the early design doc [2] is out of sync,
> > and
> > > it
> > > > > is
> > > > > > > > > appreciated to have the discussion in this mailing list
> > > thread.)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > >
> > > > > > > > > [2]
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Stephan Ewen
About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize it a
bit differently:

We have the following two options:

(1) We let MemorySegments be de-allocated by the GC. That makes it segfault
safe. But then we need a way to trigger GC in case de-allocation and
re-allocation of a bunch of segments happens quickly, which is often the
case during batch scheduling or task restart.
  - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
  - Another way could be to have a dedicated bookkeeping in the
MemoryManager (option 1.2), so that this is a number independent of the
"-XX:MaxDirectMemorySize" parameter.

(2) We manually allocate and de-allocate the memory for the MemorySegments
(option 2). That way we need not worry about triggering GC by some
threshold or bookkeeping, but it is harder to prevent segfaults. We need to
be very careful about when we release the memory segments (only in the
cleanup phase of the main thread).

If we go with option 1.1, we probably need to set
"-XX:MaxDirectMemorySize" to "off_heap_managed_memory + direct_memory" and
have "direct_memory" as a separate reserved memory pool. Because if we just
set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + jvm_overhead",
then there will be times when that entire memory is allocated by direct
buffers and we have nothing left for the JVM overhead. So we either need a
way to compensate for that (again some safety margin cutoff value) or we
will exceed container memory.

If we go with option 1.2, we need to be aware that it takes elaborate logic
to push recycling of direct buffers without always triggering a full GC.


My first guess is that the options will be easiest to do in the following
order:

  - Option 1.1 with a dedicated direct_memory parameter, as discussed
above. We would need to find a way to set the direct_memory parameter by
default. We could start with 64 MB and see how it goes in practice. One
danger I see is that setting this loo low can cause a bunch of additional
GCs compared to before (we need to watch this carefully).

  - Option 2. It is actually quite simple to implement, we could try how
segfault safe we are at the moment.

  - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize" parameter
at all and assume that all the direct memory allocations that the JVM and
Netty do are infrequent enough to be cleaned up fast enough through regular
GC. I am not sure if that is a valid assumption, though.

Best,
Stephan



On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]> wrote:

> Thanks for sharing your opinion Till.
>
> I'm also in favor of alternative 2. I was wondering whether we can avoid
> using Unsafe.allocate() for off-heap managed memory and network memory with
> alternative 3. But after giving it a second thought, I think even for
> alternative 3 using direct memory for off-heap managed memory could cause
> problems.
>
> Hi Yang,
>
> Regarding your concern, I think what proposed in this FLIP it to have both
> off-heap managed memory and network memory allocated through
> Unsafe.allocate(), which means they are practically native memory and not
> limited by JVM max direct memory. The only parts of memory limited by JVM
> max direct memory are task off-heap memory and JVM overhead, which are
> exactly alternative 2 suggests to set the JVM max direct memory to.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > Thanks for the clarification Xintong. I understand the two alternatives
> > now.
> >
> > I would be in favour of option 2 because it makes things explicit. If we
> > don't limit the direct memory, I fear that we might end up in a similar
> > situation as we are currently in: The user might see that her process
> gets
> > killed by the OS and does not know why this is the case. Consequently,
> she
> > tries to decrease the process memory size (similar to increasing the
> cutoff
> > ratio) in order to accommodate for the extra direct memory. Even worse,
> she
> > tries to decrease memory budgets which are not fully used and hence won't
> > change the overall memory consumption.
> >
> > Cheers,
> > Till
> >
> > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Let me explain this with a concrete example Till.
> > >
> > > Let's say we have the following scenario.
> > >
> > > Total Process Memory: 1GB
> > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed Memory
> and
> > > Network Memory): 800MB
> > >
> > >
> > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> value,
> > > let's say 1TB.
> > >
> > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > Overhead
> > > do not exceed 200MB, then alternative 2 and alternative 3 should have
> the
> > > same utility. Setting larger -XX:MaxDirectMemorySize will not reduce
> the
> > > sizes of the other memory pools.
> > >
> > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > Overhead potentially exceed 200MB, then
> > >
> > >    - Alternative 2 suffers from frequent OOM. To avoid that, the only
> > thing
> > >    user can do is to modify the configuration and increase JVM Direct
> > > Memory
> > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user increases
> > JVM
> > >    Direct Memory to 250MB, this will reduce the total size of other
> > memory
> > >    pools to 750MB, given the total process memory remains 1GB.
> > >    - For alternative 3, there is no chance of direct OOM. There are
> > chances
> > >    of exceeding the total process memory limit, but given that the
> > process
> > > may
> > >    not use up all the reserved native memory (Off-Heap Managed Memory,
> > > Network
> > >    Memory, JVM Metaspace), if the actual direct memory usage is
> slightly
> > > above
> > >    yet very close to 200MB, user probably do not need to change the
> > >    configurations.
> > >
> > > Therefore, I think from the user's perspective, a feasible
> configuration
> > > for alternative 2 may lead to lower resource utilization compared to
> > > alternative 3.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > > > I guess you have to help me understand the difference between
> > > alternative 2
> > > > and 3 wrt to memory under utilization Xintong.
> > > >
> > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory
> and
> > > JVM
> > > > Overhead. Then there is the risk that this size is too low resulting
> > in a
> > > > lot of garbage collection and potentially an OOM.
> > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger than
> > > > alternative 2. This would of course reduce the sizes of the other
> > memory
> > > > types.
> > > >
> > > > How would alternative 2 now result in an under utilization of memory
> > > > compared to alternative 3? If alternative 3 strictly sets a higher
> max
> > > > direct memory size and we use only little, then I would expect that
> > > > alternative 3 results in memory under utilization.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[hidden email]>
> > wrote:
> > > >
> > > > > Hi xintong,till
> > > > >
> > > > >
> > > > > > Native and Direct Memory
> > > > >
> > > > > My point is setting a very large max direct memory size when we do
> > not
> > > > > differentiate direct and native memory. If the direct
> > memory,including
> > > > user
> > > > > direct memory and framework direct memory,could be calculated
> > > > > correctly,then
> > > > > i am in favor of setting direct memory with fixed value.
> > > > >
> > > > >
> > > > >
> > > > > > Memory Calculation
> > > > >
> > > > > I agree with xintong. For Yarn and k8s,we need to check the memory
> > > > > configurations in client to avoid submitting successfully and
> failing
> > > in
> > > > > the flink master.
> > > > >
> > > > >
> > > > > Best,
> > > > >
> > > > > Yang
> > > > >
> > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > >
> > > > > > Thanks for replying, Till.
> > > > > >
> > > > > > About MemorySegment, I think you are right that we should not
> > include
> > > > > this
> > > > > > issue in the scope of this FLIP. This FLIP should concentrate on
> > how
> > > to
> > > > > > configure memory pools for TaskExecutors, with minimum
> involvement
> > on
> > > > how
> > > > > > memory consumers use it.
> > > > > >
> > > > > > About direct memory, I think alternative 3 may not having the
> same
> > > over
> > > > > > reservation issue that alternative 2 does, but at the cost of
> risk
> > of
> > > > > over
> > > > > > using memory at the container level, which is not good. My point
> is
> > > > that
> > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy to
> > > config.
> > > > > For
> > > > > > alternative 2, users might configure them higher than what
> actually
> > > > > needed,
> > > > > > just to avoid getting a direct OOM. For alternative 3, users do
> not
> > > get
> > > > > > direct OOM, so they may not config the two options aggressively
> > high.
> > > > But
> > > > > > the consequences are risks of overall container memory usage
> > exceeds
> > > > the
> > > > > > budget.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > >
> > > > > > > All in all I think it already looks quite good. Concerning the
> > > first
> > > > > open
> > > > > > > question about allocating memory segments, I was wondering
> > whether
> > > > this
> > > > > > is
> > > > > > > strictly necessary to do in the context of this FLIP or whether
> > > this
> > > > > > could
> > > > > > > be done as a follow up? Without knowing all details, I would be
> > > > > concerned
> > > > > > > that we would widen the scope of this FLIP too much because we
> > > would
> > > > > have
> > > > > > > to touch all the existing call sites of the MemoryManager where
> > we
> > > > > > allocate
> > > > > > > memory segments (this should mainly be batch operators). The
> > > addition
> > > > > of
> > > > > > > the memory reservation call to the MemoryManager should not be
> > > > affected
> > > > > > by
> > > > > > > this and I would hope that this is the only point of
> interaction
> > a
> > > > > > > streaming job would have with the MemoryManager.
> > > > > > >
> > > > > > > Concerning the second open question about setting or not
> setting
> > a
> > > > max
> > > > > > > direct memory limit, I would also be interested why Yang Wang
> > > thinks
> > > > > > > leaving it open would be best. My concern about this would be
> > that
> > > we
> > > > > > would
> > > > > > > be in a similar situation as we are now with the
> > > RocksDBStateBackend.
> > > > > If
> > > > > > > the different memory pools are not clearly separated and can
> > spill
> > > > over
> > > > > > to
> > > > > > > a different pool, then it is quite hard to understand what
> > exactly
> > > > > > causes a
> > > > > > > process to get killed for using too much memory. This could
> then
> > > > easily
> > > > > > > lead to a similar situation what we have with the cutoff-ratio.
> > So
> > > > why
> > > > > > not
> > > > > > > setting a sane default value for max direct memory and giving
> the
> > > > user
> > > > > an
> > > > > > > option to increase it if he runs into an OOM.
> > > > > > >
> > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > utilization
> > > > than
> > > > > > > alternative 3 where we set the direct memory to a higher value?
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the feedback, Yang.
> > > > > > > >
> > > > > > > > Regarding your comments:
> > > > > > > >
> > > > > > > > *Native and Direct Memory*
> > > > > > > > I think setting a very large max direct memory size
> definitely
> > > has
> > > > > some
> > > > > > > > good sides. E.g., we do not worry about direct OOM, and we
> > don't
> > > > even
> > > > > > > need
> > > > > > > > to allocate managed / network memory with Unsafe.allocate() .
> > > > > > > > However, there are also some down sides of doing this.
> > > > > > > >
> > > > > > > >    - One thing I can think of is that if a task executor
> > > container
> > > > is
> > > > > > > >    killed due to overusing memory, it could be hard for use
> to
> > > know
> > > > > > which
> > > > > > > > part
> > > > > > > >    of the memory is overused.
> > > > > > > >    - Another down side is that the JVM never trigger GC due
> to
> > > > > reaching
> > > > > > > max
> > > > > > > >    direct memory limit, because the limit is too high to be
> > > > reached.
> > > > > > That
> > > > > > > >    means we kind of relay on heap memory to trigger GC and
> > > release
> > > > > > direct
> > > > > > > >    memory. That could be a problem in cases where we have
> more
> > > > direct
> > > > > > > > memory
> > > > > > > >    usage but not enough heap activity to trigger the GC.
> > > > > > > >
> > > > > > > > Maybe you can share your reasons for preferring setting a
> very
> > > > large
> > > > > > > value,
> > > > > > > > if there are anything else I overlooked.
> > > > > > > >
> > > > > > > > *Memory Calculation*
> > > > > > > > If there is any conflict between multiple configuration that
> > user
> > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > I think doing checking on the client side is a good idea, so
> > that
> > > > on
> > > > > > > Yarn /
> > > > > > > > K8s we can discover the problem before submitting the Flink
> > > > cluster,
> > > > > > > which
> > > > > > > > is always a good thing.
> > > > > > > > But we can not only rely on the client side checking, because
> > for
> > > > > > > > standalone cluster TaskManagers on different machines may
> have
> > > > > > different
> > > > > > > > configurations and the client does see that.
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > [hidden email]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi xintong,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks for your detailed proposal. After all the memory
> > > > > configuration
> > > > > > > are
> > > > > > > > > introduced, it will be more powerful to control the flink
> > > memory
> > > > > > > usage. I
> > > > > > > > > just have few questions about it.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    - Native and Direct Memory
> > > > > > > > >
> > > > > > > > > We do not differentiate user direct memory and native
> memory.
> > > > They
> > > > > > are
> > > > > > > > all
> > > > > > > > > included in task off-heap memory. Right? So i don’t think
> we
> > > > could
> > > > > > not
> > > > > > > > set
> > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving it a
> > > very
> > > > > > large
> > > > > > > > > value.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    - Memory Calculation
> > > > > > > > >
> > > > > > > > > If the sum of and fine-grained memory(network memory,
> managed
> > > > > memory,
> > > > > > > > etc.)
> > > > > > > > > is larger than total process memory, how do we deal with
> this
> > > > > > > situation?
> > > > > > > > Do
> > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > >
> > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> 下午10:14写道:
> > > > > > > > >
> > > > > > > > > > Hi everyone,
> > > > > > > > > >
> > > > > > > > > > We would like to start a discussion thread on "FLIP-49:
> > > Unified
> > > > > > > Memory
> > > > > > > > > > Configuration for TaskExecutors"[1], where we describe
> how
> > to
> > > > > > improve
> > > > > > > > > > TaskExecutor memory configurations. The FLIP document is
> > > mostly
> > > > > > based
> > > > > > > > on
> > > > > > > > > an
> > > > > > > > > > early design "Memory Management and Configuration
> > > Reloaded"[2]
> > > > by
> > > > > > > > > Stephan,
> > > > > > > > > > with updates from follow-up discussions both online and
> > > > offline.
> > > > > > > > > >
> > > > > > > > > > This FLIP addresses several shortcomings of current
> (Flink
> > > 1.9)
> > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > >
> > > > > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > > > > >    - Complex and difficult configuration of RocksDB in
> > > > Streaming.
> > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Key changes to solve the problems can be summarized as
> > > follows.
> > > > > > > > > >
> > > > > > > > > >    - Extend memory manager to also account for memory
> usage
> > > by
> > > > > > state
> > > > > > > > > >    backends.
> > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > accounted
> > > > > > > individual
> > > > > > > > > >    memory reservations and pools.
> > > > > > > > > >    - Simplify memory configuration options and
> calculations
> > > > > logics.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Please find more details in the FLIP wiki document [1].
> > > > > > > > > >
> > > > > > > > > > (Please note that the early design doc [2] is out of
> sync,
> > > and
> > > > it
> > > > > > is
> > > > > > > > > > appreciated to have the discussion in this mailing list
> > > > thread.)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > >
> > > > > > > > > > [2]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >


On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]> wrote:

> Thanks for sharing your opinion Till.
>
> I'm also in favor of alternative 2. I was wondering whether we can avoid
> using Unsafe.allocate() for off-heap managed memory and network memory with
> alternative 3. But after giving it a second thought, I think even for
> alternative 3 using direct memory for off-heap managed memory could cause
> problems.
>
> Hi Yang,
>
> Regarding your concern, I think what proposed in this FLIP it to have both
> off-heap managed memory and network memory allocated through
> Unsafe.allocate(), which means they are practically native memory and not
> limited by JVM max direct memory. The only parts of memory limited by JVM
> max direct memory are task off-heap memory and JVM overhead, which are
> exactly alternative 2 suggests to set the JVM max direct memory to.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > Thanks for the clarification Xintong. I understand the two alternatives
> > now.
> >
> > I would be in favour of option 2 because it makes things explicit. If we
> > don't limit the direct memory, I fear that we might end up in a similar
> > situation as we are currently in: The user might see that her process
> gets
> > killed by the OS and does not know why this is the case. Consequently,
> she
> > tries to decrease the process memory size (similar to increasing the
> cutoff
> > ratio) in order to accommodate for the extra direct memory. Even worse,
> she
> > tries to decrease memory budgets which are not fully used and hence won't
> > change the overall memory consumption.
> >
> > Cheers,
> > Till
> >
> > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Let me explain this with a concrete example Till.
> > >
> > > Let's say we have the following scenario.
> > >
> > > Total Process Memory: 1GB
> > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed Memory
> and
> > > Network Memory): 800MB
> > >
> > >
> > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> value,
> > > let's say 1TB.
> > >
> > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > Overhead
> > > do not exceed 200MB, then alternative 2 and alternative 3 should have
> the
> > > same utility. Setting larger -XX:MaxDirectMemorySize will not reduce
> the
> > > sizes of the other memory pools.
> > >
> > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > Overhead potentially exceed 200MB, then
> > >
> > >    - Alternative 2 suffers from frequent OOM. To avoid that, the only
> > thing
> > >    user can do is to modify the configuration and increase JVM Direct
> > > Memory
> > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user increases
> > JVM
> > >    Direct Memory to 250MB, this will reduce the total size of other
> > memory
> > >    pools to 750MB, given the total process memory remains 1GB.
> > >    - For alternative 3, there is no chance of direct OOM. There are
> > chances
> > >    of exceeding the total process memory limit, but given that the
> > process
> > > may
> > >    not use up all the reserved native memory (Off-Heap Managed Memory,
> > > Network
> > >    Memory, JVM Metaspace), if the actual direct memory usage is
> slightly
> > > above
> > >    yet very close to 200MB, user probably do not need to change the
> > >    configurations.
> > >
> > > Therefore, I think from the user's perspective, a feasible
> configuration
> > > for alternative 2 may lead to lower resource utilization compared to
> > > alternative 3.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > > > I guess you have to help me understand the difference between
> > > alternative 2
> > > > and 3 wrt to memory under utilization Xintong.
> > > >
> > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory
> and
> > > JVM
> > > > Overhead. Then there is the risk that this size is too low resulting
> > in a
> > > > lot of garbage collection and potentially an OOM.
> > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger than
> > > > alternative 2. This would of course reduce the sizes of the other
> > memory
> > > > types.
> > > >
> > > > How would alternative 2 now result in an under utilization of memory
> > > > compared to alternative 3? If alternative 3 strictly sets a higher
> max
> > > > direct memory size and we use only little, then I would expect that
> > > > alternative 3 results in memory under utilization.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[hidden email]>
> > wrote:
> > > >
> > > > > Hi xintong,till
> > > > >
> > > > >
> > > > > > Native and Direct Memory
> > > > >
> > > > > My point is setting a very large max direct memory size when we do
> > not
> > > > > differentiate direct and native memory. If the direct
> > memory,including
> > > > user
> > > > > direct memory and framework direct memory,could be calculated
> > > > > correctly,then
> > > > > i am in favor of setting direct memory with fixed value.
> > > > >
> > > > >
> > > > >
> > > > > > Memory Calculation
> > > > >
> > > > > I agree with xintong. For Yarn and k8s,we need to check the memory
> > > > > configurations in client to avoid submitting successfully and
> failing
> > > in
> > > > > the flink master.
> > > > >
> > > > >
> > > > > Best,
> > > > >
> > > > > Yang
> > > > >
> > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > >
> > > > > > Thanks for replying, Till.
> > > > > >
> > > > > > About MemorySegment, I think you are right that we should not
> > include
> > > > > this
> > > > > > issue in the scope of this FLIP. This FLIP should concentrate on
> > how
> > > to
> > > > > > configure memory pools for TaskExecutors, with minimum
> involvement
> > on
> > > > how
> > > > > > memory consumers use it.
> > > > > >
> > > > > > About direct memory, I think alternative 3 may not having the
> same
> > > over
> > > > > > reservation issue that alternative 2 does, but at the cost of
> risk
> > of
> > > > > over
> > > > > > using memory at the container level, which is not good. My point
> is
> > > > that
> > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy to
> > > config.
> > > > > For
> > > > > > alternative 2, users might configure them higher than what
> actually
> > > > > needed,
> > > > > > just to avoid getting a direct OOM. For alternative 3, users do
> not
> > > get
> > > > > > direct OOM, so they may not config the two options aggressively
> > high.
> > > > But
> > > > > > the consequences are risks of overall container memory usage
> > exceeds
> > > > the
> > > > > > budget.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > >
> > > > > > > All in all I think it already looks quite good. Concerning the
> > > first
> > > > > open
> > > > > > > question about allocating memory segments, I was wondering
> > whether
> > > > this
> > > > > > is
> > > > > > > strictly necessary to do in the context of this FLIP or whether
> > > this
> > > > > > could
> > > > > > > be done as a follow up? Without knowing all details, I would be
> > > > > concerned
> > > > > > > that we would widen the scope of this FLIP too much because we
> > > would
> > > > > have
> > > > > > > to touch all the existing call sites of the MemoryManager where
> > we
> > > > > > allocate
> > > > > > > memory segments (this should mainly be batch operators). The
> > > addition
> > > > > of
> > > > > > > the memory reservation call to the MemoryManager should not be
> > > > affected
> > > > > > by
> > > > > > > this and I would hope that this is the only point of
> interaction
> > a
> > > > > > > streaming job would have with the MemoryManager.
> > > > > > >
> > > > > > > Concerning the second open question about setting or not
> setting
> > a
> > > > max
> > > > > > > direct memory limit, I would also be interested why Yang Wang
> > > thinks
> > > > > > > leaving it open would be best. My concern about this would be
> > that
> > > we
> > > > > > would
> > > > > > > be in a similar situation as we are now with the
> > > RocksDBStateBackend.
> > > > > If
> > > > > > > the different memory pools are not clearly separated and can
> > spill
> > > > over
> > > > > > to
> > > > > > > a different pool, then it is quite hard to understand what
> > exactly
> > > > > > causes a
> > > > > > > process to get killed for using too much memory. This could
> then
> > > > easily
> > > > > > > lead to a similar situation what we have with the cutoff-ratio.
> > So
> > > > why
> > > > > > not
> > > > > > > setting a sane default value for max direct memory and giving
> the
> > > > user
> > > > > an
> > > > > > > option to increase it if he runs into an OOM.
> > > > > > >
> > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > utilization
> > > > than
> > > > > > > alternative 3 where we set the direct memory to a higher value?
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the feedback, Yang.
> > > > > > > >
> > > > > > > > Regarding your comments:
> > > > > > > >
> > > > > > > > *Native and Direct Memory*
> > > > > > > > I think setting a very large max direct memory size
> definitely
> > > has
> > > > > some
> > > > > > > > good sides. E.g., we do not worry about direct OOM, and we
> > don't
> > > > even
> > > > > > > need
> > > > > > > > to allocate managed / network memory with Unsafe.allocate() .
> > > > > > > > However, there are also some down sides of doing this.
> > > > > > > >
> > > > > > > >    - One thing I can think of is that if a task executor
> > > container
> > > > is
> > > > > > > >    killed due to overusing memory, it could be hard for use
> to
> > > know
> > > > > > which
> > > > > > > > part
> > > > > > > >    of the memory is overused.
> > > > > > > >    - Another down side is that the JVM never trigger GC due
> to
> > > > > reaching
> > > > > > > max
> > > > > > > >    direct memory limit, because the limit is too high to be
> > > > reached.
> > > > > > That
> > > > > > > >    means we kind of relay on heap memory to trigger GC and
> > > release
> > > > > > direct
> > > > > > > >    memory. That could be a problem in cases where we have
> more
> > > > direct
> > > > > > > > memory
> > > > > > > >    usage but not enough heap activity to trigger the GC.
> > > > > > > >
> > > > > > > > Maybe you can share your reasons for preferring setting a
> very
> > > > large
> > > > > > > value,
> > > > > > > > if there are anything else I overlooked.
> > > > > > > >
> > > > > > > > *Memory Calculation*
> > > > > > > > If there is any conflict between multiple configuration that
> > user
> > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > I think doing checking on the client side is a good idea, so
> > that
> > > > on
> > > > > > > Yarn /
> > > > > > > > K8s we can discover the problem before submitting the Flink
> > > > cluster,
> > > > > > > which
> > > > > > > > is always a good thing.
> > > > > > > > But we can not only rely on the client side checking, because
> > for
> > > > > > > > standalone cluster TaskManagers on different machines may
> have
> > > > > > different
> > > > > > > > configurations and the client does see that.
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > [hidden email]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi xintong,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks for your detailed proposal. After all the memory
> > > > > configuration
> > > > > > > are
> > > > > > > > > introduced, it will be more powerful to control the flink
> > > memory
> > > > > > > usage. I
> > > > > > > > > just have few questions about it.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    - Native and Direct Memory
> > > > > > > > >
> > > > > > > > > We do not differentiate user direct memory and native
> memory.
> > > > They
> > > > > > are
> > > > > > > > all
> > > > > > > > > included in task off-heap memory. Right? So i don’t think
> we
> > > > could
> > > > > > not
> > > > > > > > set
> > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving it a
> > > very
> > > > > > large
> > > > > > > > > value.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    - Memory Calculation
> > > > > > > > >
> > > > > > > > > If the sum of and fine-grained memory(network memory,
> managed
> > > > > memory,
> > > > > > > > etc.)
> > > > > > > > > is larger than total process memory, how do we deal with
> this
> > > > > > > situation?
> > > > > > > > Do
> > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > >
> > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> 下午10:14写道:
> > > > > > > > >
> > > > > > > > > > Hi everyone,
> > > > > > > > > >
> > > > > > > > > > We would like to start a discussion thread on "FLIP-49:
> > > Unified
> > > > > > > Memory
> > > > > > > > > > Configuration for TaskExecutors"[1], where we describe
> how
> > to
> > > > > > improve
> > > > > > > > > > TaskExecutor memory configurations. The FLIP document is
> > > mostly
> > > > > > based
> > > > > > > > on
> > > > > > > > > an
> > > > > > > > > > early design "Memory Management and Configuration
> > > Reloaded"[2]
> > > > by
> > > > > > > > > Stephan,
> > > > > > > > > > with updates from follow-up discussions both online and
> > > > offline.
> > > > > > > > > >
> > > > > > > > > > This FLIP addresses several shortcomings of current
> (Flink
> > > 1.9)
> > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > >
> > > > > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > > > > >    - Complex and difficult configuration of RocksDB in
> > > > Streaming.
> > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Key changes to solve the problems can be summarized as
> > > follows.
> > > > > > > > > >
> > > > > > > > > >    - Extend memory manager to also account for memory
> usage
> > > by
> > > > > > state
> > > > > > > > > >    backends.
> > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > accounted
> > > > > > > individual
> > > > > > > > > >    memory reservations and pools.
> > > > > > > > > >    - Simplify memory configuration options and
> calculations
> > > > > logics.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Please find more details in the FLIP wiki document [1].
> > > > > > > > > >
> > > > > > > > > > (Please note that the early design doc [2] is out of
> sync,
> > > and
> > > > it
> > > > > > is
> > > > > > > > > > appreciated to have the discussion in this mailing list
> > > > thread.)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > >
> > > > > > > > > > [2]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Xintong Song
Thanks for the comments, Stephan. Summarized in this way really makes
things easier to understand.

I'm in favor of option 2, at least for the moment. I think it is not that
difficult to keep it segfault safe for memory manager, as long as we always
de-allocate the memory segment when it is released from the memory
consumers. Only if the memory consumer continue using the buffer of memory
segment after releasing it, in which case we do want the job to fail so we
detect the memory leak early.

For option 1.2, I don't think this is a good idea. Not only because the
assumption (regular GC is enough to clean direct buffers) may not always be
true, but also it makes harder for finding problems in cases of memory
overuse. E.g., user configured some direct memory for the user libraries.
If the library actually use more direct memory then configured, which
cannot be cleaned by GC because they are still in use, may lead to overuse
of the total container memory. In that case, if it didn't touch the JVM
default max direct memory limit, we cannot get a direct memory OOM and it
will become super hard to understand which part of the configuration need
to be updated.

For option 1.1, it has the similar problem as 1.2, if the exceeded direct
memory does not reach the max direct memory limit specified by the
dedicated parameter. I think it is slightly better than 1.2, only because
we can tune the parameter.

Thank you~

Xintong Song



On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen <[hidden email]> wrote:

> About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize it a
> bit differently:
>
> We have the following two options:
>
> (1) We let MemorySegments be de-allocated by the GC. That makes it segfault
> safe. But then we need a way to trigger GC in case de-allocation and
> re-allocation of a bunch of segments happens quickly, which is often the
> case during batch scheduling or task restart.
>   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
>   - Another way could be to have a dedicated bookkeeping in the
> MemoryManager (option 1.2), so that this is a number independent of the
> "-XX:MaxDirectMemorySize" parameter.
>
> (2) We manually allocate and de-allocate the memory for the MemorySegments
> (option 2). That way we need not worry about triggering GC by some
> threshold or bookkeeping, but it is harder to prevent segfaults. We need to
> be very careful about when we release the memory segments (only in the
> cleanup phase of the main thread).
>
> If we go with option 1.1, we probably need to set
> "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + direct_memory" and
> have "direct_memory" as a separate reserved memory pool. Because if we just
> set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + jvm_overhead",
> then there will be times when that entire memory is allocated by direct
> buffers and we have nothing left for the JVM overhead. So we either need a
> way to compensate for that (again some safety margin cutoff value) or we
> will exceed container memory.
>
> If we go with option 1.2, we need to be aware that it takes elaborate logic
> to push recycling of direct buffers without always triggering a full GC.
>
>
> My first guess is that the options will be easiest to do in the following
> order:
>
>   - Option 1.1 with a dedicated direct_memory parameter, as discussed
> above. We would need to find a way to set the direct_memory parameter by
> default. We could start with 64 MB and see how it goes in practice. One
> danger I see is that setting this loo low can cause a bunch of additional
> GCs compared to before (we need to watch this carefully).
>
>   - Option 2. It is actually quite simple to implement, we could try how
> segfault safe we are at the moment.
>
>   - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize" parameter
> at all and assume that all the direct memory allocations that the JVM and
> Netty do are infrequent enough to be cleaned up fast enough through regular
> GC. I am not sure if that is a valid assumption, though.
>
> Best,
> Stephan
>
>
>
> On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]>
> wrote:
>
> > Thanks for sharing your opinion Till.
> >
> > I'm also in favor of alternative 2. I was wondering whether we can avoid
> > using Unsafe.allocate() for off-heap managed memory and network memory
> with
> > alternative 3. But after giving it a second thought, I think even for
> > alternative 3 using direct memory for off-heap managed memory could cause
> > problems.
> >
> > Hi Yang,
> >
> > Regarding your concern, I think what proposed in this FLIP it to have
> both
> > off-heap managed memory and network memory allocated through
> > Unsafe.allocate(), which means they are practically native memory and not
> > limited by JVM max direct memory. The only parts of memory limited by JVM
> > max direct memory are task off-heap memory and JVM overhead, which are
> > exactly alternative 2 suggests to set the JVM max direct memory to.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > Thanks for the clarification Xintong. I understand the two alternatives
> > > now.
> > >
> > > I would be in favour of option 2 because it makes things explicit. If
> we
> > > don't limit the direct memory, I fear that we might end up in a similar
> > > situation as we are currently in: The user might see that her process
> > gets
> > > killed by the OS and does not know why this is the case. Consequently,
> > she
> > > tries to decrease the process memory size (similar to increasing the
> > cutoff
> > > ratio) in order to accommodate for the extra direct memory. Even worse,
> > she
> > > tries to decrease memory budgets which are not fully used and hence
> won't
> > > change the overall memory consumption.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > > > Let me explain this with a concrete example Till.
> > > >
> > > > Let's say we have the following scenario.
> > > >
> > > > Total Process Memory: 1GB
> > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed Memory
> > and
> > > > Network Memory): 800MB
> > > >
> > > >
> > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> > value,
> > > > let's say 1TB.
> > > >
> > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > Overhead
> > > > do not exceed 200MB, then alternative 2 and alternative 3 should have
> > the
> > > > same utility. Setting larger -XX:MaxDirectMemorySize will not reduce
> > the
> > > > sizes of the other memory pools.
> > > >
> > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > Overhead potentially exceed 200MB, then
> > > >
> > > >    - Alternative 2 suffers from frequent OOM. To avoid that, the only
> > > thing
> > > >    user can do is to modify the configuration and increase JVM Direct
> > > > Memory
> > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> increases
> > > JVM
> > > >    Direct Memory to 250MB, this will reduce the total size of other
> > > memory
> > > >    pools to 750MB, given the total process memory remains 1GB.
> > > >    - For alternative 3, there is no chance of direct OOM. There are
> > > chances
> > > >    of exceeding the total process memory limit, but given that the
> > > process
> > > > may
> > > >    not use up all the reserved native memory (Off-Heap Managed
> Memory,
> > > > Network
> > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > slightly
> > > > above
> > > >    yet very close to 200MB, user probably do not need to change the
> > > >    configurations.
> > > >
> > > > Therefore, I think from the user's perspective, a feasible
> > configuration
> > > > for alternative 2 may lead to lower resource utilization compared to
> > > > alternative 3.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > I guess you have to help me understand the difference between
> > > > alternative 2
> > > > > and 3 wrt to memory under utilization Xintong.
> > > > >
> > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory
> > and
> > > > JVM
> > > > > Overhead. Then there is the risk that this size is too low
> resulting
> > > in a
> > > > > lot of garbage collection and potentially an OOM.
> > > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger
> than
> > > > > alternative 2. This would of course reduce the sizes of the other
> > > memory
> > > > > types.
> > > > >
> > > > > How would alternative 2 now result in an under utilization of
> memory
> > > > > compared to alternative 3? If alternative 3 strictly sets a higher
> > max
> > > > > direct memory size and we use only little, then I would expect that
> > > > > alternative 3 results in memory under utilization.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Hi xintong,till
> > > > > >
> > > > > >
> > > > > > > Native and Direct Memory
> > > > > >
> > > > > > My point is setting a very large max direct memory size when we
> do
> > > not
> > > > > > differentiate direct and native memory. If the direct
> > > memory,including
> > > > > user
> > > > > > direct memory and framework direct memory,could be calculated
> > > > > > correctly,then
> > > > > > i am in favor of setting direct memory with fixed value.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Memory Calculation
> > > > > >
> > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> memory
> > > > > > configurations in client to avoid submitting successfully and
> > failing
> > > > in
> > > > > > the flink master.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Yang
> > > > > >
> > > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > > >
> > > > > > > Thanks for replying, Till.
> > > > > > >
> > > > > > > About MemorySegment, I think you are right that we should not
> > > include
> > > > > > this
> > > > > > > issue in the scope of this FLIP. This FLIP should concentrate
> on
> > > how
> > > > to
> > > > > > > configure memory pools for TaskExecutors, with minimum
> > involvement
> > > on
> > > > > how
> > > > > > > memory consumers use it.
> > > > > > >
> > > > > > > About direct memory, I think alternative 3 may not having the
> > same
> > > > over
> > > > > > > reservation issue that alternative 2 does, but at the cost of
> > risk
> > > of
> > > > > > over
> > > > > > > using memory at the container level, which is not good. My
> point
> > is
> > > > > that
> > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy to
> > > > config.
> > > > > > For
> > > > > > > alternative 2, users might configure them higher than what
> > actually
> > > > > > needed,
> > > > > > > just to avoid getting a direct OOM. For alternative 3, users do
> > not
> > > > get
> > > > > > > direct OOM, so they may not config the two options aggressively
> > > high.
> > > > > But
> > > > > > > the consequences are risks of overall container memory usage
> > > exceeds
> > > > > the
> > > > > > > budget.
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > >
> > > > > > > > All in all I think it already looks quite good. Concerning
> the
> > > > first
> > > > > > open
> > > > > > > > question about allocating memory segments, I was wondering
> > > whether
> > > > > this
> > > > > > > is
> > > > > > > > strictly necessary to do in the context of this FLIP or
> whether
> > > > this
> > > > > > > could
> > > > > > > > be done as a follow up? Without knowing all details, I would
> be
> > > > > > concerned
> > > > > > > > that we would widen the scope of this FLIP too much because
> we
> > > > would
> > > > > > have
> > > > > > > > to touch all the existing call sites of the MemoryManager
> where
> > > we
> > > > > > > allocate
> > > > > > > > memory segments (this should mainly be batch operators). The
> > > > addition
> > > > > > of
> > > > > > > > the memory reservation call to the MemoryManager should not
> be
> > > > > affected
> > > > > > > by
> > > > > > > > this and I would hope that this is the only point of
> > interaction
> > > a
> > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > >
> > > > > > > > Concerning the second open question about setting or not
> > setting
> > > a
> > > > > max
> > > > > > > > direct memory limit, I would also be interested why Yang Wang
> > > > thinks
> > > > > > > > leaving it open would be best. My concern about this would be
> > > that
> > > > we
> > > > > > > would
> > > > > > > > be in a similar situation as we are now with the
> > > > RocksDBStateBackend.
> > > > > > If
> > > > > > > > the different memory pools are not clearly separated and can
> > > spill
> > > > > over
> > > > > > > to
> > > > > > > > a different pool, then it is quite hard to understand what
> > > exactly
> > > > > > > causes a
> > > > > > > > process to get killed for using too much memory. This could
> > then
> > > > > easily
> > > > > > > > lead to a similar situation what we have with the
> cutoff-ratio.
> > > So
> > > > > why
> > > > > > > not
> > > > > > > > setting a sane default value for max direct memory and giving
> > the
> > > > > user
> > > > > > an
> > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > >
> > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > utilization
> > > > > than
> > > > > > > > alternative 3 where we set the direct memory to a higher
> value?
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Till
> > > > > > > >
> > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > >
> > > > > > > > > Regarding your comments:
> > > > > > > > >
> > > > > > > > > *Native and Direct Memory*
> > > > > > > > > I think setting a very large max direct memory size
> > definitely
> > > > has
> > > > > > some
> > > > > > > > > good sides. E.g., we do not worry about direct OOM, and we
> > > don't
> > > > > even
> > > > > > > > need
> > > > > > > > > to allocate managed / network memory with
> Unsafe.allocate() .
> > > > > > > > > However, there are also some down sides of doing this.
> > > > > > > > >
> > > > > > > > >    - One thing I can think of is that if a task executor
> > > > container
> > > > > is
> > > > > > > > >    killed due to overusing memory, it could be hard for use
> > to
> > > > know
> > > > > > > which
> > > > > > > > > part
> > > > > > > > >    of the memory is overused.
> > > > > > > > >    - Another down side is that the JVM never trigger GC due
> > to
> > > > > > reaching
> > > > > > > > max
> > > > > > > > >    direct memory limit, because the limit is too high to be
> > > > > reached.
> > > > > > > That
> > > > > > > > >    means we kind of relay on heap memory to trigger GC and
> > > > release
> > > > > > > direct
> > > > > > > > >    memory. That could be a problem in cases where we have
> > more
> > > > > direct
> > > > > > > > > memory
> > > > > > > > >    usage but not enough heap activity to trigger the GC.
> > > > > > > > >
> > > > > > > > > Maybe you can share your reasons for preferring setting a
> > very
> > > > > large
> > > > > > > > value,
> > > > > > > > > if there are anything else I overlooked.
> > > > > > > > >
> > > > > > > > > *Memory Calculation*
> > > > > > > > > If there is any conflict between multiple configuration
> that
> > > user
> > > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > > I think doing checking on the client side is a good idea,
> so
> > > that
> > > > > on
> > > > > > > > Yarn /
> > > > > > > > > K8s we can discover the problem before submitting the Flink
> > > > > cluster,
> > > > > > > > which
> > > > > > > > > is always a good thing.
> > > > > > > > > But we can not only rely on the client side checking,
> because
> > > for
> > > > > > > > > standalone cluster TaskManagers on different machines may
> > have
> > > > > > > different
> > > > > > > > > configurations and the client does see that.
> > > > > > > > > What do you think?
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi xintong,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks for your detailed proposal. After all the memory
> > > > > > configuration
> > > > > > > > are
> > > > > > > > > > introduced, it will be more powerful to control the flink
> > > > memory
> > > > > > > > usage. I
> > > > > > > > > > just have few questions about it.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > >
> > > > > > > > > > We do not differentiate user direct memory and native
> > memory.
> > > > > They
> > > > > > > are
> > > > > > > > > all
> > > > > > > > > > included in task off-heap memory. Right? So i don’t think
> > we
> > > > > could
> > > > > > > not
> > > > > > > > > set
> > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving
> it a
> > > > very
> > > > > > > large
> > > > > > > > > > value.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    - Memory Calculation
> > > > > > > > > >
> > > > > > > > > > If the sum of and fine-grained memory(network memory,
> > managed
> > > > > > memory,
> > > > > > > > > etc.)
> > > > > > > > > > is larger than total process memory, how do we deal with
> > this
> > > > > > > > situation?
> > > > > > > > > Do
> > > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > > >
> > > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> > 下午10:14写道:
> > > > > > > > > >
> > > > > > > > > > > Hi everyone,
> > > > > > > > > > >
> > > > > > > > > > > We would like to start a discussion thread on "FLIP-49:
> > > > Unified
> > > > > > > > Memory
> > > > > > > > > > > Configuration for TaskExecutors"[1], where we describe
> > how
> > > to
> > > > > > > improve
> > > > > > > > > > > TaskExecutor memory configurations. The FLIP document
> is
> > > > mostly
> > > > > > > based
> > > > > > > > > on
> > > > > > > > > > an
> > > > > > > > > > > early design "Memory Management and Configuration
> > > > Reloaded"[2]
> > > > > by
> > > > > > > > > > Stephan,
> > > > > > > > > > > with updates from follow-up discussions both online and
> > > > > offline.
> > > > > > > > > > >
> > > > > > > > > > > This FLIP addresses several shortcomings of current
> > (Flink
> > > > 1.9)
> > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > >
> > > > > > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > > > > > >    - Complex and difficult configuration of RocksDB in
> > > > > Streaming.
> > > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Key changes to solve the problems can be summarized as
> > > > follows.
> > > > > > > > > > >
> > > > > > > > > > >    - Extend memory manager to also account for memory
> > usage
> > > > by
> > > > > > > state
> > > > > > > > > > >    backends.
> > > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > > accounted
> > > > > > > > individual
> > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > >    - Simplify memory configuration options and
> > calculations
> > > > > > logics.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Please find more details in the FLIP wiki document [1].
> > > > > > > > > > >
> > > > > > > > > > > (Please note that the early design doc [2] is out of
> > sync,
> > > > and
> > > > > it
> > > > > > > is
> > > > > > > > > > > appreciated to have the discussion in this mailing list
> > > > > thread.)
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thank you~
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > >
> > > > > > > > > > > [2]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>
>
> On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]>
> wrote:
>
> > Thanks for sharing your opinion Till.
> >
> > I'm also in favor of alternative 2. I was wondering whether we can avoid
> > using Unsafe.allocate() for off-heap managed memory and network memory
> with
> > alternative 3. But after giving it a second thought, I think even for
> > alternative 3 using direct memory for off-heap managed memory could cause
> > problems.
> >
> > Hi Yang,
> >
> > Regarding your concern, I think what proposed in this FLIP it to have
> both
> > off-heap managed memory and network memory allocated through
> > Unsafe.allocate(), which means they are practically native memory and not
> > limited by JVM max direct memory. The only parts of memory limited by JVM
> > max direct memory are task off-heap memory and JVM overhead, which are
> > exactly alternative 2 suggests to set the JVM max direct memory to.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > Thanks for the clarification Xintong. I understand the two alternatives
> > > now.
> > >
> > > I would be in favour of option 2 because it makes things explicit. If
> we
> > > don't limit the direct memory, I fear that we might end up in a similar
> > > situation as we are currently in: The user might see that her process
> > gets
> > > killed by the OS and does not know why this is the case. Consequently,
> > she
> > > tries to decrease the process memory size (similar to increasing the
> > cutoff
> > > ratio) in order to accommodate for the extra direct memory. Even worse,
> > she
> > > tries to decrease memory budgets which are not fully used and hence
> won't
> > > change the overall memory consumption.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > > > Let me explain this with a concrete example Till.
> > > >
> > > > Let's say we have the following scenario.
> > > >
> > > > Total Process Memory: 1GB
> > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed Memory
> > and
> > > > Network Memory): 800MB
> > > >
> > > >
> > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> > value,
> > > > let's say 1TB.
> > > >
> > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > Overhead
> > > > do not exceed 200MB, then alternative 2 and alternative 3 should have
> > the
> > > > same utility. Setting larger -XX:MaxDirectMemorySize will not reduce
> > the
> > > > sizes of the other memory pools.
> > > >
> > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > Overhead potentially exceed 200MB, then
> > > >
> > > >    - Alternative 2 suffers from frequent OOM. To avoid that, the only
> > > thing
> > > >    user can do is to modify the configuration and increase JVM Direct
> > > > Memory
> > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> increases
> > > JVM
> > > >    Direct Memory to 250MB, this will reduce the total size of other
> > > memory
> > > >    pools to 750MB, given the total process memory remains 1GB.
> > > >    - For alternative 3, there is no chance of direct OOM. There are
> > > chances
> > > >    of exceeding the total process memory limit, but given that the
> > > process
> > > > may
> > > >    not use up all the reserved native memory (Off-Heap Managed
> Memory,
> > > > Network
> > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > slightly
> > > > above
> > > >    yet very close to 200MB, user probably do not need to change the
> > > >    configurations.
> > > >
> > > > Therefore, I think from the user's perspective, a feasible
> > configuration
> > > > for alternative 2 may lead to lower resource utilization compared to
> > > > alternative 3.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > I guess you have to help me understand the difference between
> > > > alternative 2
> > > > > and 3 wrt to memory under utilization Xintong.
> > > > >
> > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory
> > and
> > > > JVM
> > > > > Overhead. Then there is the risk that this size is too low
> resulting
> > > in a
> > > > > lot of garbage collection and potentially an OOM.
> > > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger
> than
> > > > > alternative 2. This would of course reduce the sizes of the other
> > > memory
> > > > > types.
> > > > >
> > > > > How would alternative 2 now result in an under utilization of
> memory
> > > > > compared to alternative 3? If alternative 3 strictly sets a higher
> > max
> > > > > direct memory size and we use only little, then I would expect that
> > > > > alternative 3 results in memory under utilization.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Hi xintong,till
> > > > > >
> > > > > >
> > > > > > > Native and Direct Memory
> > > > > >
> > > > > > My point is setting a very large max direct memory size when we
> do
> > > not
> > > > > > differentiate direct and native memory. If the direct
> > > memory,including
> > > > > user
> > > > > > direct memory and framework direct memory,could be calculated
> > > > > > correctly,then
> > > > > > i am in favor of setting direct memory with fixed value.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Memory Calculation
> > > > > >
> > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> memory
> > > > > > configurations in client to avoid submitting successfully and
> > failing
> > > > in
> > > > > > the flink master.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Yang
> > > > > >
> > > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > > >
> > > > > > > Thanks for replying, Till.
> > > > > > >
> > > > > > > About MemorySegment, I think you are right that we should not
> > > include
> > > > > > this
> > > > > > > issue in the scope of this FLIP. This FLIP should concentrate
> on
> > > how
> > > > to
> > > > > > > configure memory pools for TaskExecutors, with minimum
> > involvement
> > > on
> > > > > how
> > > > > > > memory consumers use it.
> > > > > > >
> > > > > > > About direct memory, I think alternative 3 may not having the
> > same
> > > > over
> > > > > > > reservation issue that alternative 2 does, but at the cost of
> > risk
> > > of
> > > > > > over
> > > > > > > using memory at the container level, which is not good. My
> point
> > is
> > > > > that
> > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy to
> > > > config.
> > > > > > For
> > > > > > > alternative 2, users might configure them higher than what
> > actually
> > > > > > needed,
> > > > > > > just to avoid getting a direct OOM. For alternative 3, users do
> > not
> > > > get
> > > > > > > direct OOM, so they may not config the two options aggressively
> > > high.
> > > > > But
> > > > > > > the consequences are risks of overall container memory usage
> > > exceeds
> > > > > the
> > > > > > > budget.
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > >
> > > > > > > > All in all I think it already looks quite good. Concerning
> the
> > > > first
> > > > > > open
> > > > > > > > question about allocating memory segments, I was wondering
> > > whether
> > > > > this
> > > > > > > is
> > > > > > > > strictly necessary to do in the context of this FLIP or
> whether
> > > > this
> > > > > > > could
> > > > > > > > be done as a follow up? Without knowing all details, I would
> be
> > > > > > concerned
> > > > > > > > that we would widen the scope of this FLIP too much because
> we
> > > > would
> > > > > > have
> > > > > > > > to touch all the existing call sites of the MemoryManager
> where
> > > we
> > > > > > > allocate
> > > > > > > > memory segments (this should mainly be batch operators). The
> > > > addition
> > > > > > of
> > > > > > > > the memory reservation call to the MemoryManager should not
> be
> > > > > affected
> > > > > > > by
> > > > > > > > this and I would hope that this is the only point of
> > interaction
> > > a
> > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > >
> > > > > > > > Concerning the second open question about setting or not
> > setting
> > > a
> > > > > max
> > > > > > > > direct memory limit, I would also be interested why Yang Wang
> > > > thinks
> > > > > > > > leaving it open would be best. My concern about this would be
> > > that
> > > > we
> > > > > > > would
> > > > > > > > be in a similar situation as we are now with the
> > > > RocksDBStateBackend.
> > > > > > If
> > > > > > > > the different memory pools are not clearly separated and can
> > > spill
> > > > > over
> > > > > > > to
> > > > > > > > a different pool, then it is quite hard to understand what
> > > exactly
> > > > > > > causes a
> > > > > > > > process to get killed for using too much memory. This could
> > then
> > > > > easily
> > > > > > > > lead to a similar situation what we have with the
> cutoff-ratio.
> > > So
> > > > > why
> > > > > > > not
> > > > > > > > setting a sane default value for max direct memory and giving
> > the
> > > > > user
> > > > > > an
> > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > >
> > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > utilization
> > > > > than
> > > > > > > > alternative 3 where we set the direct memory to a higher
> value?
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Till
> > > > > > > >
> > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > >
> > > > > > > > > Regarding your comments:
> > > > > > > > >
> > > > > > > > > *Native and Direct Memory*
> > > > > > > > > I think setting a very large max direct memory size
> > definitely
> > > > has
> > > > > > some
> > > > > > > > > good sides. E.g., we do not worry about direct OOM, and we
> > > don't
> > > > > even
> > > > > > > > need
> > > > > > > > > to allocate managed / network memory with
> Unsafe.allocate() .
> > > > > > > > > However, there are also some down sides of doing this.
> > > > > > > > >
> > > > > > > > >    - One thing I can think of is that if a task executor
> > > > container
> > > > > is
> > > > > > > > >    killed due to overusing memory, it could be hard for use
> > to
> > > > know
> > > > > > > which
> > > > > > > > > part
> > > > > > > > >    of the memory is overused.
> > > > > > > > >    - Another down side is that the JVM never trigger GC due
> > to
> > > > > > reaching
> > > > > > > > max
> > > > > > > > >    direct memory limit, because the limit is too high to be
> > > > > reached.
> > > > > > > That
> > > > > > > > >    means we kind of relay on heap memory to trigger GC and
> > > > release
> > > > > > > direct
> > > > > > > > >    memory. That could be a problem in cases where we have
> > more
> > > > > direct
> > > > > > > > > memory
> > > > > > > > >    usage but not enough heap activity to trigger the GC.
> > > > > > > > >
> > > > > > > > > Maybe you can share your reasons for preferring setting a
> > very
> > > > > large
> > > > > > > > value,
> > > > > > > > > if there are anything else I overlooked.
> > > > > > > > >
> > > > > > > > > *Memory Calculation*
> > > > > > > > > If there is any conflict between multiple configuration
> that
> > > user
> > > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > > I think doing checking on the client side is a good idea,
> so
> > > that
> > > > > on
> > > > > > > > Yarn /
> > > > > > > > > K8s we can discover the problem before submitting the Flink
> > > > > cluster,
> > > > > > > > which
> > > > > > > > > is always a good thing.
> > > > > > > > > But we can not only rely on the client side checking,
> because
> > > for
> > > > > > > > > standalone cluster TaskManagers on different machines may
> > have
> > > > > > > different
> > > > > > > > > configurations and the client does see that.
> > > > > > > > > What do you think?
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi xintong,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks for your detailed proposal. After all the memory
> > > > > > configuration
> > > > > > > > are
> > > > > > > > > > introduced, it will be more powerful to control the flink
> > > > memory
> > > > > > > > usage. I
> > > > > > > > > > just have few questions about it.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > >
> > > > > > > > > > We do not differentiate user direct memory and native
> > memory.
> > > > > They
> > > > > > > are
> > > > > > > > > all
> > > > > > > > > > included in task off-heap memory. Right? So i don’t think
> > we
> > > > > could
> > > > > > > not
> > > > > > > > > set
> > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving
> it a
> > > > very
> > > > > > > large
> > > > > > > > > > value.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    - Memory Calculation
> > > > > > > > > >
> > > > > > > > > > If the sum of and fine-grained memory(network memory,
> > managed
> > > > > > memory,
> > > > > > > > > etc.)
> > > > > > > > > > is larger than total process memory, how do we deal with
> > this
> > > > > > > > situation?
> > > > > > > > > Do
> > > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > > >
> > > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> > 下午10:14写道:
> > > > > > > > > >
> > > > > > > > > > > Hi everyone,
> > > > > > > > > > >
> > > > > > > > > > > We would like to start a discussion thread on "FLIP-49:
> > > > Unified
> > > > > > > > Memory
> > > > > > > > > > > Configuration for TaskExecutors"[1], where we describe
> > how
> > > to
> > > > > > > improve
> > > > > > > > > > > TaskExecutor memory configurations. The FLIP document
> is
> > > > mostly
> > > > > > > based
> > > > > > > > > on
> > > > > > > > > > an
> > > > > > > > > > > early design "Memory Management and Configuration
> > > > Reloaded"[2]
> > > > > by
> > > > > > > > > > Stephan,
> > > > > > > > > > > with updates from follow-up discussions both online and
> > > > > offline.
> > > > > > > > > > >
> > > > > > > > > > > This FLIP addresses several shortcomings of current
> > (Flink
> > > > 1.9)
> > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > >
> > > > > > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > > > > > >    - Complex and difficult configuration of RocksDB in
> > > > > Streaming.
> > > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Key changes to solve the problems can be summarized as
> > > > follows.
> > > > > > > > > > >
> > > > > > > > > > >    - Extend memory manager to also account for memory
> > usage
> > > > by
> > > > > > > state
> > > > > > > > > > >    backends.
> > > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > > accounted
> > > > > > > > individual
> > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > >    - Simplify memory configuration options and
> > calculations
> > > > > > logics.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Please find more details in the FLIP wiki document [1].
> > > > > > > > > > >
> > > > > > > > > > > (Please note that the early design doc [2] is out of
> > sync,
> > > > and
> > > > > it
> > > > > > > is
> > > > > > > > > > > appreciated to have the discussion in this mailing list
> > > > > thread.)
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thank you~
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > >
> > > > > > > > > > > [2]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Stephan Ewen
My main concern with option 2 (manually release memory) is that segfaults
in the JVM send off all sorts of alarms on user ends. So we need to
guarantee that this never happens.

The trickyness is in tasks that uses data structures / algorithms with
additional threads, like hash table spill/read and sorting threads. We need
to ensure that these cleanly shut down before we can exit the task.
I am not sure that we have that guaranteed already, that's why option 1.1
seemed simpler to me.

On Mon, Aug 19, 2019 at 3:42 PM Xintong Song <[hidden email]> wrote:

> Thanks for the comments, Stephan. Summarized in this way really makes
> things easier to understand.
>
> I'm in favor of option 2, at least for the moment. I think it is not that
> difficult to keep it segfault safe for memory manager, as long as we always
> de-allocate the memory segment when it is released from the memory
> consumers. Only if the memory consumer continue using the buffer of memory
> segment after releasing it, in which case we do want the job to fail so we
> detect the memory leak early.
>
> For option 1.2, I don't think this is a good idea. Not only because the
> assumption (regular GC is enough to clean direct buffers) may not always be
> true, but also it makes harder for finding problems in cases of memory
> overuse. E.g., user configured some direct memory for the user libraries.
> If the library actually use more direct memory then configured, which
> cannot be cleaned by GC because they are still in use, may lead to overuse
> of the total container memory. In that case, if it didn't touch the JVM
> default max direct memory limit, we cannot get a direct memory OOM and it
> will become super hard to understand which part of the configuration need
> to be updated.
>
> For option 1.1, it has the similar problem as 1.2, if the exceeded direct
> memory does not reach the max direct memory limit specified by the
> dedicated parameter. I think it is slightly better than 1.2, only because
> we can tune the parameter.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen <[hidden email]> wrote:
>
> > About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize
> it a
> > bit differently:
> >
> > We have the following two options:
> >
> > (1) We let MemorySegments be de-allocated by the GC. That makes it
> segfault
> > safe. But then we need a way to trigger GC in case de-allocation and
> > re-allocation of a bunch of segments happens quickly, which is often the
> > case during batch scheduling or task restart.
> >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
> >   - Another way could be to have a dedicated bookkeeping in the
> > MemoryManager (option 1.2), so that this is a number independent of the
> > "-XX:MaxDirectMemorySize" parameter.
> >
> > (2) We manually allocate and de-allocate the memory for the
> MemorySegments
> > (option 2). That way we need not worry about triggering GC by some
> > threshold or bookkeeping, but it is harder to prevent segfaults. We need
> to
> > be very careful about when we release the memory segments (only in the
> > cleanup phase of the main thread).
> >
> > If we go with option 1.1, we probably need to set
> > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + direct_memory"
> and
> > have "direct_memory" as a separate reserved memory pool. Because if we
> just
> > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> jvm_overhead",
> > then there will be times when that entire memory is allocated by direct
> > buffers and we have nothing left for the JVM overhead. So we either need
> a
> > way to compensate for that (again some safety margin cutoff value) or we
> > will exceed container memory.
> >
> > If we go with option 1.2, we need to be aware that it takes elaborate
> logic
> > to push recycling of direct buffers without always triggering a full GC.
> >
> >
> > My first guess is that the options will be easiest to do in the following
> > order:
> >
> >   - Option 1.1 with a dedicated direct_memory parameter, as discussed
> > above. We would need to find a way to set the direct_memory parameter by
> > default. We could start with 64 MB and see how it goes in practice. One
> > danger I see is that setting this loo low can cause a bunch of additional
> > GCs compared to before (we need to watch this carefully).
> >
> >   - Option 2. It is actually quite simple to implement, we could try how
> > segfault safe we are at the moment.
> >
> >   - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize"
> parameter
> > at all and assume that all the direct memory allocations that the JVM and
> > Netty do are infrequent enough to be cleaned up fast enough through
> regular
> > GC. I am not sure if that is a valid assumption, though.
> >
> > Best,
> > Stephan
> >
> >
> >
> > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Thanks for sharing your opinion Till.
> > >
> > > I'm also in favor of alternative 2. I was wondering whether we can
> avoid
> > > using Unsafe.allocate() for off-heap managed memory and network memory
> > with
> > > alternative 3. But after giving it a second thought, I think even for
> > > alternative 3 using direct memory for off-heap managed memory could
> cause
> > > problems.
> > >
> > > Hi Yang,
> > >
> > > Regarding your concern, I think what proposed in this FLIP it to have
> > both
> > > off-heap managed memory and network memory allocated through
> > > Unsafe.allocate(), which means they are practically native memory and
> not
> > > limited by JVM max direct memory. The only parts of memory limited by
> JVM
> > > max direct memory are task off-heap memory and JVM overhead, which are
> > > exactly alternative 2 suggests to set the JVM max direct memory to.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for the clarification Xintong. I understand the two
> alternatives
> > > > now.
> > > >
> > > > I would be in favour of option 2 because it makes things explicit. If
> > we
> > > > don't limit the direct memory, I fear that we might end up in a
> similar
> > > > situation as we are currently in: The user might see that her process
> > > gets
> > > > killed by the OS and does not know why this is the case.
> Consequently,
> > > she
> > > > tries to decrease the process memory size (similar to increasing the
> > > cutoff
> > > > ratio) in order to accommodate for the extra direct memory. Even
> worse,
> > > she
> > > > tries to decrease memory budgets which are not fully used and hence
> > won't
> > > > change the overall memory consumption.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Let me explain this with a concrete example Till.
> > > > >
> > > > > Let's say we have the following scenario.
> > > > >
> > > > > Total Process Memory: 1GB
> > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed
> Memory
> > > and
> > > > > Network Memory): 800MB
> > > > >
> > > > >
> > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> > > value,
> > > > > let's say 1TB.
> > > > >
> > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > Overhead
> > > > > do not exceed 200MB, then alternative 2 and alternative 3 should
> have
> > > the
> > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> reduce
> > > the
> > > > > sizes of the other memory pools.
> > > > >
> > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > Overhead potentially exceed 200MB, then
> > > > >
> > > > >    - Alternative 2 suffers from frequent OOM. To avoid that, the
> only
> > > > thing
> > > > >    user can do is to modify the configuration and increase JVM
> Direct
> > > > > Memory
> > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > increases
> > > > JVM
> > > > >    Direct Memory to 250MB, this will reduce the total size of other
> > > > memory
> > > > >    pools to 750MB, given the total process memory remains 1GB.
> > > > >    - For alternative 3, there is no chance of direct OOM. There are
> > > > chances
> > > > >    of exceeding the total process memory limit, but given that the
> > > > process
> > > > > may
> > > > >    not use up all the reserved native memory (Off-Heap Managed
> > Memory,
> > > > > Network
> > > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > > slightly
> > > > > above
> > > > >    yet very close to 200MB, user probably do not need to change the
> > > > >    configurations.
> > > > >
> > > > > Therefore, I think from the user's perspective, a feasible
> > > configuration
> > > > > for alternative 2 may lead to lower resource utilization compared
> to
> > > > > alternative 3.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > I guess you have to help me understand the difference between
> > > > > alternative 2
> > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > >
> > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap
> Memory
> > > and
> > > > > JVM
> > > > > > Overhead. Then there is the risk that this size is too low
> > resulting
> > > > in a
> > > > > > lot of garbage collection and potentially an OOM.
> > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger
> > than
> > > > > > alternative 2. This would of course reduce the sizes of the other
> > > > memory
> > > > > > types.
> > > > > >
> > > > > > How would alternative 2 now result in an under utilization of
> > memory
> > > > > > compared to alternative 3? If alternative 3 strictly sets a
> higher
> > > max
> > > > > > direct memory size and we use only little, then I would expect
> that
> > > > > > alternative 3 results in memory under utilization.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[hidden email]
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi xintong,till
> > > > > > >
> > > > > > >
> > > > > > > > Native and Direct Memory
> > > > > > >
> > > > > > > My point is setting a very large max direct memory size when we
> > do
> > > > not
> > > > > > > differentiate direct and native memory. If the direct
> > > > memory,including
> > > > > > user
> > > > > > > direct memory and framework direct memory,could be calculated
> > > > > > > correctly,then
> > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Memory Calculation
> > > > > > >
> > > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> > memory
> > > > > > > configurations in client to avoid submitting successfully and
> > > failing
> > > > > in
> > > > > > > the flink master.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Yang
> > > > > > >
> > > > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > > > >
> > > > > > > > Thanks for replying, Till.
> > > > > > > >
> > > > > > > > About MemorySegment, I think you are right that we should not
> > > > include
> > > > > > > this
> > > > > > > > issue in the scope of this FLIP. This FLIP should concentrate
> > on
> > > > how
> > > > > to
> > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > involvement
> > > > on
> > > > > > how
> > > > > > > > memory consumers use it.
> > > > > > > >
> > > > > > > > About direct memory, I think alternative 3 may not having the
> > > same
> > > > > over
> > > > > > > > reservation issue that alternative 2 does, but at the cost of
> > > risk
> > > > of
> > > > > > > over
> > > > > > > > using memory at the container level, which is not good. My
> > point
> > > is
> > > > > > that
> > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy
> to
> > > > > config.
> > > > > > > For
> > > > > > > > alternative 2, users might configure them higher than what
> > > actually
> > > > > > > needed,
> > > > > > > > just to avoid getting a direct OOM. For alternative 3, users
> do
> > > not
> > > > > get
> > > > > > > > direct OOM, so they may not config the two options
> aggressively
> > > > high.
> > > > > > But
> > > > > > > > the consequences are risks of overall container memory usage
> > > > exceeds
> > > > > > the
> > > > > > > > budget.
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > >
> > > > > > > > > All in all I think it already looks quite good. Concerning
> > the
> > > > > first
> > > > > > > open
> > > > > > > > > question about allocating memory segments, I was wondering
> > > > whether
> > > > > > this
> > > > > > > > is
> > > > > > > > > strictly necessary to do in the context of this FLIP or
> > whether
> > > > > this
> > > > > > > > could
> > > > > > > > > be done as a follow up? Without knowing all details, I
> would
> > be
> > > > > > > concerned
> > > > > > > > > that we would widen the scope of this FLIP too much because
> > we
> > > > > would
> > > > > > > have
> > > > > > > > > to touch all the existing call sites of the MemoryManager
> > where
> > > > we
> > > > > > > > allocate
> > > > > > > > > memory segments (this should mainly be batch operators).
> The
> > > > > addition
> > > > > > > of
> > > > > > > > > the memory reservation call to the MemoryManager should not
> > be
> > > > > > affected
> > > > > > > > by
> > > > > > > > > this and I would hope that this is the only point of
> > > interaction
> > > > a
> > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > >
> > > > > > > > > Concerning the second open question about setting or not
> > > setting
> > > > a
> > > > > > max
> > > > > > > > > direct memory limit, I would also be interested why Yang
> Wang
> > > > > thinks
> > > > > > > > > leaving it open would be best. My concern about this would
> be
> > > > that
> > > > > we
> > > > > > > > would
> > > > > > > > > be in a similar situation as we are now with the
> > > > > RocksDBStateBackend.
> > > > > > > If
> > > > > > > > > the different memory pools are not clearly separated and
> can
> > > > spill
> > > > > > over
> > > > > > > > to
> > > > > > > > > a different pool, then it is quite hard to understand what
> > > > exactly
> > > > > > > > causes a
> > > > > > > > > process to get killed for using too much memory. This could
> > > then
> > > > > > easily
> > > > > > > > > lead to a similar situation what we have with the
> > cutoff-ratio.
> > > > So
> > > > > > why
> > > > > > > > not
> > > > > > > > > setting a sane default value for max direct memory and
> giving
> > > the
> > > > > > user
> > > > > > > an
> > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > >
> > > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > > utilization
> > > > > > than
> > > > > > > > > alternative 3 where we set the direct memory to a higher
> > value?
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Till
> > > > > > > > >
> > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > [hidden email]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > >
> > > > > > > > > > Regarding your comments:
> > > > > > > > > >
> > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > I think setting a very large max direct memory size
> > > definitely
> > > > > has
> > > > > > > some
> > > > > > > > > > good sides. E.g., we do not worry about direct OOM, and
> we
> > > > don't
> > > > > > even
> > > > > > > > > need
> > > > > > > > > > to allocate managed / network memory with
> > Unsafe.allocate() .
> > > > > > > > > > However, there are also some down sides of doing this.
> > > > > > > > > >
> > > > > > > > > >    - One thing I can think of is that if a task executor
> > > > > container
> > > > > > is
> > > > > > > > > >    killed due to overusing memory, it could be hard for
> use
> > > to
> > > > > know
> > > > > > > > which
> > > > > > > > > > part
> > > > > > > > > >    of the memory is overused.
> > > > > > > > > >    - Another down side is that the JVM never trigger GC
> due
> > > to
> > > > > > > reaching
> > > > > > > > > max
> > > > > > > > > >    direct memory limit, because the limit is too high to
> be
> > > > > > reached.
> > > > > > > > That
> > > > > > > > > >    means we kind of relay on heap memory to trigger GC
> and
> > > > > release
> > > > > > > > direct
> > > > > > > > > >    memory. That could be a problem in cases where we have
> > > more
> > > > > > direct
> > > > > > > > > > memory
> > > > > > > > > >    usage but not enough heap activity to trigger the GC.
> > > > > > > > > >
> > > > > > > > > > Maybe you can share your reasons for preferring setting a
> > > very
> > > > > > large
> > > > > > > > > value,
> > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > >
> > > > > > > > > > *Memory Calculation*
> > > > > > > > > > If there is any conflict between multiple configuration
> > that
> > > > user
> > > > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > > > I think doing checking on the client side is a good idea,
> > so
> > > > that
> > > > > > on
> > > > > > > > > Yarn /
> > > > > > > > > > K8s we can discover the problem before submitting the
> Flink
> > > > > > cluster,
> > > > > > > > > which
> > > > > > > > > > is always a good thing.
> > > > > > > > > > But we can not only rely on the client side checking,
> > because
> > > > for
> > > > > > > > > > standalone cluster TaskManagers on different machines may
> > > have
> > > > > > > > different
> > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi xintong,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your detailed proposal. After all the memory
> > > > > > > configuration
> > > > > > > > > are
> > > > > > > > > > > introduced, it will be more powerful to control the
> flink
> > > > > memory
> > > > > > > > > usage. I
> > > > > > > > > > > just have few questions about it.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > >
> > > > > > > > > > > We do not differentiate user direct memory and native
> > > memory.
> > > > > > They
> > > > > > > > are
> > > > > > > > > > all
> > > > > > > > > > > included in task off-heap memory. Right? So i don’t
> think
> > > we
> > > > > > could
> > > > > > > > not
> > > > > > > > > > set
> > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving
> > it a
> > > > > very
> > > > > > > > large
> > > > > > > > > > > value.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > >
> > > > > > > > > > > If the sum of and fine-grained memory(network memory,
> > > managed
> > > > > > > memory,
> > > > > > > > > > etc.)
> > > > > > > > > > > is larger than total process memory, how do we deal
> with
> > > this
> > > > > > > > > situation?
> > > > > > > > > > Do
> > > > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> > > 下午10:14写道:
> > > > > > > > > > >
> > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > >
> > > > > > > > > > > > We would like to start a discussion thread on
> "FLIP-49:
> > > > > Unified
> > > > > > > > > Memory
> > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> describe
> > > how
> > > > to
> > > > > > > > improve
> > > > > > > > > > > > TaskExecutor memory configurations. The FLIP document
> > is
> > > > > mostly
> > > > > > > > based
> > > > > > > > > > on
> > > > > > > > > > > an
> > > > > > > > > > > > early design "Memory Management and Configuration
> > > > > Reloaded"[2]
> > > > > > by
> > > > > > > > > > > Stephan,
> > > > > > > > > > > > with updates from follow-up discussions both online
> and
> > > > > > offline.
> > > > > > > > > > > >
> > > > > > > > > > > > This FLIP addresses several shortcomings of current
> > > (Flink
> > > > > 1.9)
> > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > >
> > > > > > > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > > > > > > >    - Complex and difficult configuration of RocksDB
> in
> > > > > > Streaming.
> > > > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Key changes to solve the problems can be summarized
> as
> > > > > follows.
> > > > > > > > > > > >
> > > > > > > > > > > >    - Extend memory manager to also account for memory
> > > usage
> > > > > by
> > > > > > > > state
> > > > > > > > > > > >    backends.
> > > > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > > > accounted
> > > > > > > > > individual
> > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > >    - Simplify memory configuration options and
> > > calculations
> > > > > > > logics.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Please find more details in the FLIP wiki document
> [1].
> > > > > > > > > > > >
> > > > > > > > > > > > (Please note that the early design doc [2] is out of
> > > sync,
> > > > > and
> > > > > > it
> > > > > > > > is
> > > > > > > > > > > > appreciated to have the discussion in this mailing
> list
> > > > > > thread.)
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > >
> > > > > > > > > > > > [2]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> >
> > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Thanks for sharing your opinion Till.
> > >
> > > I'm also in favor of alternative 2. I was wondering whether we can
> avoid
> > > using Unsafe.allocate() for off-heap managed memory and network memory
> > with
> > > alternative 3. But after giving it a second thought, I think even for
> > > alternative 3 using direct memory for off-heap managed memory could
> cause
> > > problems.
> > >
> > > Hi Yang,
> > >
> > > Regarding your concern, I think what proposed in this FLIP it to have
> > both
> > > off-heap managed memory and network memory allocated through
> > > Unsafe.allocate(), which means they are practically native memory and
> not
> > > limited by JVM max direct memory. The only parts of memory limited by
> JVM
> > > max direct memory are task off-heap memory and JVM overhead, which are
> > > exactly alternative 2 suggests to set the JVM max direct memory to.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for the clarification Xintong. I understand the two
> alternatives
> > > > now.
> > > >
> > > > I would be in favour of option 2 because it makes things explicit. If
> > we
> > > > don't limit the direct memory, I fear that we might end up in a
> similar
> > > > situation as we are currently in: The user might see that her process
> > > gets
> > > > killed by the OS and does not know why this is the case.
> Consequently,
> > > she
> > > > tries to decrease the process memory size (similar to increasing the
> > > cutoff
> > > > ratio) in order to accommodate for the extra direct memory. Even
> worse,
> > > she
> > > > tries to decrease memory budgets which are not fully used and hence
> > won't
> > > > change the overall memory consumption.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Let me explain this with a concrete example Till.
> > > > >
> > > > > Let's say we have the following scenario.
> > > > >
> > > > > Total Process Memory: 1GB
> > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed
> Memory
> > > and
> > > > > Network Memory): 800MB
> > > > >
> > > > >
> > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> > > value,
> > > > > let's say 1TB.
> > > > >
> > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > Overhead
> > > > > do not exceed 200MB, then alternative 2 and alternative 3 should
> have
> > > the
> > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> reduce
> > > the
> > > > > sizes of the other memory pools.
> > > > >
> > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > Overhead potentially exceed 200MB, then
> > > > >
> > > > >    - Alternative 2 suffers from frequent OOM. To avoid that, the
> only
> > > > thing
> > > > >    user can do is to modify the configuration and increase JVM
> Direct
> > > > > Memory
> > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > increases
> > > > JVM
> > > > >    Direct Memory to 250MB, this will reduce the total size of other
> > > > memory
> > > > >    pools to 750MB, given the total process memory remains 1GB.
> > > > >    - For alternative 3, there is no chance of direct OOM. There are
> > > > chances
> > > > >    of exceeding the total process memory limit, but given that the
> > > > process
> > > > > may
> > > > >    not use up all the reserved native memory (Off-Heap Managed
> > Memory,
> > > > > Network
> > > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > > slightly
> > > > > above
> > > > >    yet very close to 200MB, user probably do not need to change the
> > > > >    configurations.
> > > > >
> > > > > Therefore, I think from the user's perspective, a feasible
> > > configuration
> > > > > for alternative 2 may lead to lower resource utilization compared
> to
> > > > > alternative 3.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > I guess you have to help me understand the difference between
> > > > > alternative 2
> > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > >
> > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap
> Memory
> > > and
> > > > > JVM
> > > > > > Overhead. Then there is the risk that this size is too low
> > resulting
> > > > in a
> > > > > > lot of garbage collection and potentially an OOM.
> > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger
> > than
> > > > > > alternative 2. This would of course reduce the sizes of the other
> > > > memory
> > > > > > types.
> > > > > >
> > > > > > How would alternative 2 now result in an under utilization of
> > memory
> > > > > > compared to alternative 3? If alternative 3 strictly sets a
> higher
> > > max
> > > > > > direct memory size and we use only little, then I would expect
> that
> > > > > > alternative 3 results in memory under utilization.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[hidden email]
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi xintong,till
> > > > > > >
> > > > > > >
> > > > > > > > Native and Direct Memory
> > > > > > >
> > > > > > > My point is setting a very large max direct memory size when we
> > do
> > > > not
> > > > > > > differentiate direct and native memory. If the direct
> > > > memory,including
> > > > > > user
> > > > > > > direct memory and framework direct memory,could be calculated
> > > > > > > correctly,then
> > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Memory Calculation
> > > > > > >
> > > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> > memory
> > > > > > > configurations in client to avoid submitting successfully and
> > > failing
> > > > > in
> > > > > > > the flink master.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Yang
> > > > > > >
> > > > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > > > >
> > > > > > > > Thanks for replying, Till.
> > > > > > > >
> > > > > > > > About MemorySegment, I think you are right that we should not
> > > > include
> > > > > > > this
> > > > > > > > issue in the scope of this FLIP. This FLIP should concentrate
> > on
> > > > how
> > > > > to
> > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > involvement
> > > > on
> > > > > > how
> > > > > > > > memory consumers use it.
> > > > > > > >
> > > > > > > > About direct memory, I think alternative 3 may not having the
> > > same
> > > > > over
> > > > > > > > reservation issue that alternative 2 does, but at the cost of
> > > risk
> > > > of
> > > > > > > over
> > > > > > > > using memory at the container level, which is not good. My
> > point
> > > is
> > > > > > that
> > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy
> to
> > > > > config.
> > > > > > > For
> > > > > > > > alternative 2, users might configure them higher than what
> > > actually
> > > > > > > needed,
> > > > > > > > just to avoid getting a direct OOM. For alternative 3, users
> do
> > > not
> > > > > get
> > > > > > > > direct OOM, so they may not config the two options
> aggressively
> > > > high.
> > > > > > But
> > > > > > > > the consequences are risks of overall container memory usage
> > > > exceeds
> > > > > > the
> > > > > > > > budget.
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > >
> > > > > > > > > All in all I think it already looks quite good. Concerning
> > the
> > > > > first
> > > > > > > open
> > > > > > > > > question about allocating memory segments, I was wondering
> > > > whether
> > > > > > this
> > > > > > > > is
> > > > > > > > > strictly necessary to do in the context of this FLIP or
> > whether
> > > > > this
> > > > > > > > could
> > > > > > > > > be done as a follow up? Without knowing all details, I
> would
> > be
> > > > > > > concerned
> > > > > > > > > that we would widen the scope of this FLIP too much because
> > we
> > > > > would
> > > > > > > have
> > > > > > > > > to touch all the existing call sites of the MemoryManager
> > where
> > > > we
> > > > > > > > allocate
> > > > > > > > > memory segments (this should mainly be batch operators).
> The
> > > > > addition
> > > > > > > of
> > > > > > > > > the memory reservation call to the MemoryManager should not
> > be
> > > > > > affected
> > > > > > > > by
> > > > > > > > > this and I would hope that this is the only point of
> > > interaction
> > > > a
> > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > >
> > > > > > > > > Concerning the second open question about setting or not
> > > setting
> > > > a
> > > > > > max
> > > > > > > > > direct memory limit, I would also be interested why Yang
> Wang
> > > > > thinks
> > > > > > > > > leaving it open would be best. My concern about this would
> be
> > > > that
> > > > > we
> > > > > > > > would
> > > > > > > > > be in a similar situation as we are now with the
> > > > > RocksDBStateBackend.
> > > > > > > If
> > > > > > > > > the different memory pools are not clearly separated and
> can
> > > > spill
> > > > > > over
> > > > > > > > to
> > > > > > > > > a different pool, then it is quite hard to understand what
> > > > exactly
> > > > > > > > causes a
> > > > > > > > > process to get killed for using too much memory. This could
> > > then
> > > > > > easily
> > > > > > > > > lead to a similar situation what we have with the
> > cutoff-ratio.
> > > > So
> > > > > > why
> > > > > > > > not
> > > > > > > > > setting a sane default value for max direct memory and
> giving
> > > the
> > > > > > user
> > > > > > > an
> > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > >
> > > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > > utilization
> > > > > > than
> > > > > > > > > alternative 3 where we set the direct memory to a higher
> > value?
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Till
> > > > > > > > >
> > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > [hidden email]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > >
> > > > > > > > > > Regarding your comments:
> > > > > > > > > >
> > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > I think setting a very large max direct memory size
> > > definitely
> > > > > has
> > > > > > > some
> > > > > > > > > > good sides. E.g., we do not worry about direct OOM, and
> we
> > > > don't
> > > > > > even
> > > > > > > > > need
> > > > > > > > > > to allocate managed / network memory with
> > Unsafe.allocate() .
> > > > > > > > > > However, there are also some down sides of doing this.
> > > > > > > > > >
> > > > > > > > > >    - One thing I can think of is that if a task executor
> > > > > container
> > > > > > is
> > > > > > > > > >    killed due to overusing memory, it could be hard for
> use
> > > to
> > > > > know
> > > > > > > > which
> > > > > > > > > > part
> > > > > > > > > >    of the memory is overused.
> > > > > > > > > >    - Another down side is that the JVM never trigger GC
> due
> > > to
> > > > > > > reaching
> > > > > > > > > max
> > > > > > > > > >    direct memory limit, because the limit is too high to
> be
> > > > > > reached.
> > > > > > > > That
> > > > > > > > > >    means we kind of relay on heap memory to trigger GC
> and
> > > > > release
> > > > > > > > direct
> > > > > > > > > >    memory. That could be a problem in cases where we have
> > > more
> > > > > > direct
> > > > > > > > > > memory
> > > > > > > > > >    usage but not enough heap activity to trigger the GC.
> > > > > > > > > >
> > > > > > > > > > Maybe you can share your reasons for preferring setting a
> > > very
> > > > > > large
> > > > > > > > > value,
> > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > >
> > > > > > > > > > *Memory Calculation*
> > > > > > > > > > If there is any conflict between multiple configuration
> > that
> > > > user
> > > > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > > > I think doing checking on the client side is a good idea,
> > so
> > > > that
> > > > > > on
> > > > > > > > > Yarn /
> > > > > > > > > > K8s we can discover the problem before submitting the
> Flink
> > > > > > cluster,
> > > > > > > > > which
> > > > > > > > > > is always a good thing.
> > > > > > > > > > But we can not only rely on the client side checking,
> > because
> > > > for
> > > > > > > > > > standalone cluster TaskManagers on different machines may
> > > have
> > > > > > > > different
> > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi xintong,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your detailed proposal. After all the memory
> > > > > > > configuration
> > > > > > > > > are
> > > > > > > > > > > introduced, it will be more powerful to control the
> flink
> > > > > memory
> > > > > > > > > usage. I
> > > > > > > > > > > just have few questions about it.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > >
> > > > > > > > > > > We do not differentiate user direct memory and native
> > > memory.
> > > > > > They
> > > > > > > > are
> > > > > > > > > > all
> > > > > > > > > > > included in task off-heap memory. Right? So i don’t
> think
> > > we
> > > > > > could
> > > > > > > > not
> > > > > > > > > > set
> > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving
> > it a
> > > > > very
> > > > > > > > large
> > > > > > > > > > > value.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > >
> > > > > > > > > > > If the sum of and fine-grained memory(network memory,
> > > managed
> > > > > > > memory,
> > > > > > > > > > etc.)
> > > > > > > > > > > is larger than total process memory, how do we deal
> with
> > > this
> > > > > > > > > situation?
> > > > > > > > > > Do
> > > > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> > > 下午10:14写道:
> > > > > > > > > > >
> > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > >
> > > > > > > > > > > > We would like to start a discussion thread on
> "FLIP-49:
> > > > > Unified
> > > > > > > > > Memory
> > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> describe
> > > how
> > > > to
> > > > > > > > improve
> > > > > > > > > > > > TaskExecutor memory configurations. The FLIP document
> > is
> > > > > mostly
> > > > > > > > based
> > > > > > > > > > on
> > > > > > > > > > > an
> > > > > > > > > > > > early design "Memory Management and Configuration
> > > > > Reloaded"[2]
> > > > > > by
> > > > > > > > > > > Stephan,
> > > > > > > > > > > > with updates from follow-up discussions both online
> and
> > > > > > offline.
> > > > > > > > > > > >
> > > > > > > > > > > > This FLIP addresses several shortcomings of current
> > > (Flink
> > > > > 1.9)
> > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > >
> > > > > > > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > > > > > > >    - Complex and difficult configuration of RocksDB
> in
> > > > > > Streaming.
> > > > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Key changes to solve the problems can be summarized
> as
> > > > > follows.
> > > > > > > > > > > >
> > > > > > > > > > > >    - Extend memory manager to also account for memory
> > > usage
> > > > > by
> > > > > > > > state
> > > > > > > > > > > >    backends.
> > > > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > > > accounted
> > > > > > > > > individual
> > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > >    - Simplify memory configuration options and
> > > calculations
> > > > > > > logics.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Please find more details in the FLIP wiki document
> [1].
> > > > > > > > > > > >
> > > > > > > > > > > > (Please note that the early design doc [2] is out of
> > > sync,
> > > > > and
> > > > > > it
> > > > > > > > is
> > > > > > > > > > > > appreciated to have the discussion in this mailing
> list
> > > > > > thread.)
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > >
> > > > > > > > > > > > [2]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

JingsongLee-2
Hi stephan:

About option 2:

if additional threads not cleanly shut down before we can exit the task:
In the current case of memory reuse, it has freed up the memory it
 uses. If this memory is used by other tasks and asynchronous threads
 of exited task may still be writing, there will be concurrent security
 problems, and even lead to errors in user computing results.

So I think this is a serious and intolerable bug, No matter what the
 option is, it should be avoided.

About direct memory cleaned by GC:
I don't think it is a good idea, I've encountered so many situations
 that it's too late for GC to cause DirectMemory OOM. Release and
 allocate DirectMemory depend on the type of user job, which is
 often beyond our control.

Best,
Jingsong Lee


------------------------------------------------------------------
From:Stephan Ewen <[hidden email]>
Send Time:2019年8月19日(星期一) 15:56
To:dev <[hidden email]>
Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

My main concern with option 2 (manually release memory) is that segfaults
in the JVM send off all sorts of alarms on user ends. So we need to
guarantee that this never happens.

The trickyness is in tasks that uses data structures / algorithms with
additional threads, like hash table spill/read and sorting threads. We need
to ensure that these cleanly shut down before we can exit the task.
I am not sure that we have that guaranteed already, that's why option 1.1
seemed simpler to me.

On Mon, Aug 19, 2019 at 3:42 PM Xintong Song <[hidden email]> wrote:

> Thanks for the comments, Stephan. Summarized in this way really makes
> things easier to understand.
>
> I'm in favor of option 2, at least for the moment. I think it is not that
> difficult to keep it segfault safe for memory manager, as long as we always
> de-allocate the memory segment when it is released from the memory
> consumers. Only if the memory consumer continue using the buffer of memory
> segment after releasing it, in which case we do want the job to fail so we
> detect the memory leak early.
>
> For option 1.2, I don't think this is a good idea. Not only because the
> assumption (regular GC is enough to clean direct buffers) may not always be
> true, but also it makes harder for finding problems in cases of memory
> overuse. E.g., user configured some direct memory for the user libraries.
> If the library actually use more direct memory then configured, which
> cannot be cleaned by GC because they are still in use, may lead to overuse
> of the total container memory. In that case, if it didn't touch the JVM
> default max direct memory limit, we cannot get a direct memory OOM and it
> will become super hard to understand which part of the configuration need
> to be updated.
>
> For option 1.1, it has the similar problem as 1.2, if the exceeded direct
> memory does not reach the max direct memory limit specified by the
> dedicated parameter. I think it is slightly better than 1.2, only because
> we can tune the parameter.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen <[hidden email]> wrote:
>
> > About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize
> it a
> > bit differently:
> >
> > We have the following two options:
> >
> > (1) We let MemorySegments be de-allocated by the GC. That makes it
> segfault
> > safe. But then we need a way to trigger GC in case de-allocation and
> > re-allocation of a bunch of segments happens quickly, which is often the
> > case during batch scheduling or task restart.
> >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
> >   - Another way could be to have a dedicated bookkeeping in the
> > MemoryManager (option 1.2), so that this is a number independent of the
> > "-XX:MaxDirectMemorySize" parameter.
> >
> > (2) We manually allocate and de-allocate the memory for the
> MemorySegments
> > (option 2). That way we need not worry about triggering GC by some
> > threshold or bookkeeping, but it is harder to prevent segfaults. We need
> to
> > be very careful about when we release the memory segments (only in the
> > cleanup phase of the main thread).
> >
> > If we go with option 1.1, we probably need to set
> > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + direct_memory"
> and
> > have "direct_memory" as a separate reserved memory pool. Because if we
> just
> > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> jvm_overhead",
> > then there will be times when that entire memory is allocated by direct
> > buffers and we have nothing left for the JVM overhead. So we either need
> a
> > way to compensate for that (again some safety margin cutoff value) or we
> > will exceed container memory.
> >
> > If we go with option 1.2, we need to be aware that it takes elaborate
> logic
> > to push recycling of direct buffers without always triggering a full GC.
> >
> >
> > My first guess is that the options will be easiest to do in the following
> > order:
> >
> >   - Option 1.1 with a dedicated direct_memory parameter, as discussed
> > above. We would need to find a way to set the direct_memory parameter by
> > default. We could start with 64 MB and see how it goes in practice. One
> > danger I see is that setting this loo low can cause a bunch of additional
> > GCs compared to before (we need to watch this carefully).
> >
> >   - Option 2. It is actually quite simple to implement, we could try how
> > segfault safe we are at the moment.
> >
> >   - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize"
> parameter
> > at all and assume that all the direct memory allocations that the JVM and
> > Netty do are infrequent enough to be cleaned up fast enough through
> regular
> > GC. I am not sure if that is a valid assumption, though.
> >
> > Best,
> > Stephan
> >
> >
> >
> > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Thanks for sharing your opinion Till.
> > >
> > > I'm also in favor of alternative 2. I was wondering whether we can
> avoid
> > > using Unsafe.allocate() for off-heap managed memory and network memory
> > with
> > > alternative 3. But after giving it a second thought, I think even for
> > > alternative 3 using direct memory for off-heap managed memory could
> cause
> > > problems.
> > >
> > > Hi Yang,
> > >
> > > Regarding your concern, I think what proposed in this FLIP it to have
> > both
> > > off-heap managed memory and network memory allocated through
> > > Unsafe.allocate(), which means they are practically native memory and
> not
> > > limited by JVM max direct memory. The only parts of memory limited by
> JVM
> > > max direct memory are task off-heap memory and JVM overhead, which are
> > > exactly alternative 2 suggests to set the JVM max direct memory to.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for the clarification Xintong. I understand the two
> alternatives
> > > > now.
> > > >
> > > > I would be in favour of option 2 because it makes things explicit. If
> > we
> > > > don't limit the direct memory, I fear that we might end up in a
> similar
> > > > situation as we are currently in: The user might see that her process
> > > gets
> > > > killed by the OS and does not know why this is the case.
> Consequently,
> > > she
> > > > tries to decrease the process memory size (similar to increasing the
> > > cutoff
> > > > ratio) in order to accommodate for the extra direct memory. Even
> worse,
> > > she
> > > > tries to decrease memory budgets which are not fully used and hence
> > won't
> > > > change the overall memory consumption.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Let me explain this with a concrete example Till.
> > > > >
> > > > > Let's say we have the following scenario.
> > > > >
> > > > > Total Process Memory: 1GB
> > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed
> Memory
> > > and
> > > > > Network Memory): 800MB
> > > > >
> > > > >
> > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> > > value,
> > > > > let's say 1TB.
> > > > >
> > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > Overhead
> > > > > do not exceed 200MB, then alternative 2 and alternative 3 should
> have
> > > the
> > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> reduce
> > > the
> > > > > sizes of the other memory pools.
> > > > >
> > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > Overhead potentially exceed 200MB, then
> > > > >
> > > > >    - Alternative 2 suffers from frequent OOM. To avoid that, the
> only
> > > > thing
> > > > >    user can do is to modify the configuration and increase JVM
> Direct
> > > > > Memory
> > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > increases
> > > > JVM
> > > > >    Direct Memory to 250MB, this will reduce the total size of other
> > > > memory
> > > > >    pools to 750MB, given the total process memory remains 1GB.
> > > > >    - For alternative 3, there is no chance of direct OOM. There are
> > > > chances
> > > > >    of exceeding the total process memory limit, but given that the
> > > > process
> > > > > may
> > > > >    not use up all the reserved native memory (Off-Heap Managed
> > Memory,
> > > > > Network
> > > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > > slightly
> > > > > above
> > > > >    yet very close to 200MB, user probably do not need to change the
> > > > >    configurations.
> > > > >
> > > > > Therefore, I think from the user's perspective, a feasible
> > > configuration
> > > > > for alternative 2 may lead to lower resource utilization compared
> to
> > > > > alternative 3.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > I guess you have to help me understand the difference between
> > > > > alternative 2
> > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > >
> > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap
> Memory
> > > and
> > > > > JVM
> > > > > > Overhead. Then there is the risk that this size is too low
> > resulting
> > > > in a
> > > > > > lot of garbage collection and potentially an OOM.
> > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger
> > than
> > > > > > alternative 2. This would of course reduce the sizes of the other
> > > > memory
> > > > > > types.
> > > > > >
> > > > > > How would alternative 2 now result in an under utilization of
> > memory
> > > > > > compared to alternative 3? If alternative 3 strictly sets a
> higher
> > > max
> > > > > > direct memory size and we use only little, then I would expect
> that
> > > > > > alternative 3 results in memory under utilization.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[hidden email]
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi xintong,till
> > > > > > >
> > > > > > >
> > > > > > > > Native and Direct Memory
> > > > > > >
> > > > > > > My point is setting a very large max direct memory size when we
> > do
> > > > not
> > > > > > > differentiate direct and native memory. If the direct
> > > > memory,including
> > > > > > user
> > > > > > > direct memory and framework direct memory,could be calculated
> > > > > > > correctly,then
> > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Memory Calculation
> > > > > > >
> > > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> > memory
> > > > > > > configurations in client to avoid submitting successfully and
> > > failing
> > > > > in
> > > > > > > the flink master.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Yang
> > > > > > >
> > > > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > > > >
> > > > > > > > Thanks for replying, Till.
> > > > > > > >
> > > > > > > > About MemorySegment, I think you are right that we should not
> > > > include
> > > > > > > this
> > > > > > > > issue in the scope of this FLIP. This FLIP should concentrate
> > on
> > > > how
> > > > > to
> > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > involvement
> > > > on
> > > > > > how
> > > > > > > > memory consumers use it.
> > > > > > > >
> > > > > > > > About direct memory, I think alternative 3 may not having the
> > > same
> > > > > over
> > > > > > > > reservation issue that alternative 2 does, but at the cost of
> > > risk
> > > > of
> > > > > > > over
> > > > > > > > using memory at the container level, which is not good. My
> > point
> > > is
> > > > > > that
> > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy
> to
> > > > > config.
> > > > > > > For
> > > > > > > > alternative 2, users might configure them higher than what
> > > actually
> > > > > > > needed,
> > > > > > > > just to avoid getting a direct OOM. For alternative 3, users
> do
> > > not
> > > > > get
> > > > > > > > direct OOM, so they may not config the two options
> aggressively
> > > > high.
> > > > > > But
> > > > > > > > the consequences are risks of overall container memory usage
> > > > exceeds
> > > > > > the
> > > > > > > > budget.
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > >
> > > > > > > > > All in all I think it already looks quite good. Concerning
> > the
> > > > > first
> > > > > > > open
> > > > > > > > > question about allocating memory segments, I was wondering
> > > > whether
> > > > > > this
> > > > > > > > is
> > > > > > > > > strictly necessary to do in the context of this FLIP or
> > whether
> > > > > this
> > > > > > > > could
> > > > > > > > > be done as a follow up? Without knowing all details, I
> would
> > be
> > > > > > > concerned
> > > > > > > > > that we would widen the scope of this FLIP too much because
> > we
> > > > > would
> > > > > > > have
> > > > > > > > > to touch all the existing call sites of the MemoryManager
> > where
> > > > we
> > > > > > > > allocate
> > > > > > > > > memory segments (this should mainly be batch operators).
> The
> > > > > addition
> > > > > > > of
> > > > > > > > > the memory reservation call to the MemoryManager should not
> > be
> > > > > > affected
> > > > > > > > by
> > > > > > > > > this and I would hope that this is the only point of
> > > interaction
> > > > a
> > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > >
> > > > > > > > > Concerning the second open question about setting or not
> > > setting
> > > > a
> > > > > > max
> > > > > > > > > direct memory limit, I would also be interested why Yang
> Wang
> > > > > thinks
> > > > > > > > > leaving it open would be best. My concern about this would
> be
> > > > that
> > > > > we
> > > > > > > > would
> > > > > > > > > be in a similar situation as we are now with the
> > > > > RocksDBStateBackend.
> > > > > > > If
> > > > > > > > > the different memory pools are not clearly separated and
> can
> > > > spill
> > > > > > over
> > > > > > > > to
> > > > > > > > > a different pool, then it is quite hard to understand what
> > > > exactly
> > > > > > > > causes a
> > > > > > > > > process to get killed for using too much memory. This could
> > > then
> > > > > > easily
> > > > > > > > > lead to a similar situation what we have with the
> > cutoff-ratio.
> > > > So
> > > > > > why
> > > > > > > > not
> > > > > > > > > setting a sane default value for max direct memory and
> giving
> > > the
> > > > > > user
> > > > > > > an
> > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > >
> > > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > > utilization
> > > > > > than
> > > > > > > > > alternative 3 where we set the direct memory to a higher
> > value?
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Till
> > > > > > > > >
> > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > [hidden email]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > >
> > > > > > > > > > Regarding your comments:
> > > > > > > > > >
> > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > I think setting a very large max direct memory size
> > > definitely
> > > > > has
> > > > > > > some
> > > > > > > > > > good sides. E.g., we do not worry about direct OOM, and
> we
> > > > don't
> > > > > > even
> > > > > > > > > need
> > > > > > > > > > to allocate managed / network memory with
> > Unsafe.allocate() .
> > > > > > > > > > However, there are also some down sides of doing this.
> > > > > > > > > >
> > > > > > > > > >    - One thing I can think of is that if a task executor
> > > > > container
> > > > > > is
> > > > > > > > > >    killed due to overusing memory, it could be hard for
> use
> > > to
> > > > > know
> > > > > > > > which
> > > > > > > > > > part
> > > > > > > > > >    of the memory is overused.
> > > > > > > > > >    - Another down side is that the JVM never trigger GC
> due
> > > to
> > > > > > > reaching
> > > > > > > > > max
> > > > > > > > > >    direct memory limit, because the limit is too high to
> be
> > > > > > reached.
> > > > > > > > That
> > > > > > > > > >    means we kind of relay on heap memory to trigger GC
> and
> > > > > release
> > > > > > > > direct
> > > > > > > > > >    memory. That could be a problem in cases where we have
> > > more
> > > > > > direct
> > > > > > > > > > memory
> > > > > > > > > >    usage but not enough heap activity to trigger the GC.
> > > > > > > > > >
> > > > > > > > > > Maybe you can share your reasons for preferring setting a
> > > very
> > > > > > large
> > > > > > > > > value,
> > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > >
> > > > > > > > > > *Memory Calculation*
> > > > > > > > > > If there is any conflict between multiple configuration
> > that
> > > > user
> > > > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > > > I think doing checking on the client side is a good idea,
> > so
> > > > that
> > > > > > on
> > > > > > > > > Yarn /
> > > > > > > > > > K8s we can discover the problem before submitting the
> Flink
> > > > > > cluster,
> > > > > > > > > which
> > > > > > > > > > is always a good thing.
> > > > > > > > > > But we can not only rely on the client side checking,
> > because
> > > > for
> > > > > > > > > > standalone cluster TaskManagers on different machines may
> > > have
> > > > > > > > different
> > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi xintong,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your detailed proposal. After all the memory
> > > > > > > configuration
> > > > > > > > > are
> > > > > > > > > > > introduced, it will be more powerful to control the
> flink
> > > > > memory
> > > > > > > > > usage. I
> > > > > > > > > > > just have few questions about it.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > >
> > > > > > > > > > > We do not differentiate user direct memory and native
> > > memory.
> > > > > > They
> > > > > > > > are
> > > > > > > > > > all
> > > > > > > > > > > included in task off-heap memory. Right? So i don’t
> think
> > > we
> > > > > > could
> > > > > > > > not
> > > > > > > > > > set
> > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving
> > it a
> > > > > very
> > > > > > > > large
> > > > > > > > > > > value.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > >
> > > > > > > > > > > If the sum of and fine-grained memory(network memory,
> > > managed
> > > > > > > memory,
> > > > > > > > > > etc.)
> > > > > > > > > > > is larger than total process memory, how do we deal
> with
> > > this
> > > > > > > > > situation?
> > > > > > > > > > Do
> > > > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> > > 下午10:14写道:
> > > > > > > > > > >
> > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > >
> > > > > > > > > > > > We would like to start a discussion thread on
> "FLIP-49:
> > > > > Unified
> > > > > > > > > Memory
> > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> describe
> > > how
> > > > to
> > > > > > > > improve
> > > > > > > > > > > > TaskExecutor memory configurations. The FLIP document
> > is
> > > > > mostly
> > > > > > > > based
> > > > > > > > > > on
> > > > > > > > > > > an
> > > > > > > > > > > > early design "Memory Management and Configuration
> > > > > Reloaded"[2]
> > > > > > by
> > > > > > > > > > > Stephan,
> > > > > > > > > > > > with updates from follow-up discussions both online
> and
> > > > > > offline.
> > > > > > > > > > > >
> > > > > > > > > > > > This FLIP addresses several shortcomings of current
> > > (Flink
> > > > > 1.9)
> > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > >
> > > > > > > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > > > > > > >    - Complex and difficult configuration of RocksDB
> in
> > > > > > Streaming.
> > > > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Key changes to solve the problems can be summarized
> as
> > > > > follows.
> > > > > > > > > > > >
> > > > > > > > > > > >    - Extend memory manager to also account for memory
> > > usage
> > > > > by
> > > > > > > > state
> > > > > > > > > > > >    backends.
> > > > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > > > accounted
> > > > > > > > > individual
> > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > >    - Simplify memory configuration options and
> > > calculations
> > > > > > > logics.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Please find more details in the FLIP wiki document
> [1].
> > > > > > > > > > > >
> > > > > > > > > > > > (Please note that the early design doc [2] is out of
> > > sync,
> > > > > and
> > > > > > it
> > > > > > > > is
> > > > > > > > > > > > appreciated to have the discussion in this mailing
> list
> > > > > > thread.)
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > >
> > > > > > > > > > > > [2]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> >
> > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Thanks for sharing your opinion Till.
> > >
> > > I'm also in favor of alternative 2. I was wondering whether we can
> avoid
> > > using Unsafe.allocate() for off-heap managed memory and network memory
> > with
> > > alternative 3. But after giving it a second thought, I think even for
> > > alternative 3 using direct memory for off-heap managed memory could
> cause
> > > problems.
> > >
> > > Hi Yang,
> > >
> > > Regarding your concern, I think what proposed in this FLIP it to have
> > both
> > > off-heap managed memory and network memory allocated through
> > > Unsafe.allocate(), which means they are practically native memory and
> not
> > > limited by JVM max direct memory. The only parts of memory limited by
> JVM
> > > max direct memory are task off-heap memory and JVM overhead, which are
> > > exactly alternative 2 suggests to set the JVM max direct memory to.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for the clarification Xintong. I understand the two
> alternatives
> > > > now.
> > > >
> > > > I would be in favour of option 2 because it makes things explicit. If
> > we
> > > > don't limit the direct memory, I fear that we might end up in a
> similar
> > > > situation as we are currently in: The user might see that her process
> > > gets
> > > > killed by the OS and does not know why this is the case.
> Consequently,
> > > she
> > > > tries to decrease the process memory size (similar to increasing the
> > > cutoff
> > > > ratio) in order to accommodate for the extra direct memory. Even
> worse,
> > > she
> > > > tries to decrease memory budgets which are not fully used and hence
> > won't
> > > > change the overall memory consumption.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Let me explain this with a concrete example Till.
> > > > >
> > > > > Let's say we have the following scenario.
> > > > >
> > > > > Total Process Memory: 1GB
> > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed
> Memory
> > > and
> > > > > Network Memory): 800MB
> > > > >
> > > > >
> > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> > > value,
> > > > > let's say 1TB.
> > > > >
> > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > Overhead
> > > > > do not exceed 200MB, then alternative 2 and alternative 3 should
> have
> > > the
> > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> reduce
> > > the
> > > > > sizes of the other memory pools.
> > > > >
> > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > Overhead potentially exceed 200MB, then
> > > > >
> > > > >    - Alternative 2 suffers from frequent OOM. To avoid that, the
> only
> > > > thing
> > > > >    user can do is to modify the configuration and increase JVM
> Direct
> > > > > Memory
> > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > increases
> > > > JVM
> > > > >    Direct Memory to 250MB, this will reduce the total size of other
> > > > memory
> > > > >    pools to 750MB, given the total process memory remains 1GB.
> > > > >    - For alternative 3, there is no chance of direct OOM. There are
> > > > chances
> > > > >    of exceeding the total process memory limit, but given that the
> > > > process
> > > > > may
> > > > >    not use up all the reserved native memory (Off-Heap Managed
> > Memory,
> > > > > Network
> > > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > > slightly
> > > > > above
> > > > >    yet very close to 200MB, user probably do not need to change the
> > > > >    configurations.
> > > > >
> > > > > Therefore, I think from the user's perspective, a feasible
> > > configuration
> > > > > for alternative 2 may lead to lower resource utilization compared
> to
> > > > > alternative 3.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > I guess you have to help me understand the difference between
> > > > > alternative 2
> > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > >
> > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap
> Memory
> > > and
> > > > > JVM
> > > > > > Overhead. Then there is the risk that this size is too low
> > resulting
> > > > in a
> > > > > > lot of garbage collection and potentially an OOM.
> > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger
> > than
> > > > > > alternative 2. This would of course reduce the sizes of the other
> > > > memory
> > > > > > types.
> > > > > >
> > > > > > How would alternative 2 now result in an under utilization of
> > memory
> > > > > > compared to alternative 3? If alternative 3 strictly sets a
> higher
> > > max
> > > > > > direct memory size and we use only little, then I would expect
> that
> > > > > > alternative 3 results in memory under utilization.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[hidden email]
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi xintong,till
> > > > > > >
> > > > > > >
> > > > > > > > Native and Direct Memory
> > > > > > >
> > > > > > > My point is setting a very large max direct memory size when we
> > do
> > > > not
> > > > > > > differentiate direct and native memory. If the direct
> > > > memory,including
> > > > > > user
> > > > > > > direct memory and framework direct memory,could be calculated
> > > > > > > correctly,then
> > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Memory Calculation
> > > > > > >
> > > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> > memory
> > > > > > > configurations in client to avoid submitting successfully and
> > > failing
> > > > > in
> > > > > > > the flink master.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Yang
> > > > > > >
> > > > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > > > >
> > > > > > > > Thanks for replying, Till.
> > > > > > > >
> > > > > > > > About MemorySegment, I think you are right that we should not
> > > > include
> > > > > > > this
> > > > > > > > issue in the scope of this FLIP. This FLIP should concentrate
> > on
> > > > how
> > > > > to
> > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > involvement
> > > > on
> > > > > > how
> > > > > > > > memory consumers use it.
> > > > > > > >
> > > > > > > > About direct memory, I think alternative 3 may not having the
> > > same
> > > > > over
> > > > > > > > reservation issue that alternative 2 does, but at the cost of
> > > risk
> > > > of
> > > > > > > over
> > > > > > > > using memory at the container level, which is not good. My
> > point
> > > is
> > > > > > that
> > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy
> to
> > > > > config.
> > > > > > > For
> > > > > > > > alternative 2, users might configure them higher than what
> > > actually
> > > > > > > needed,
> > > > > > > > just to avoid getting a direct OOM. For alternative 3, users
> do
> > > not
> > > > > get
> > > > > > > > direct OOM, so they may not config the two options
> aggressively
> > > > high.
> > > > > > But
> > > > > > > > the consequences are risks of overall container memory usage
> > > > exceeds
> > > > > > the
> > > > > > > > budget.
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > >
> > > > > > > > > All in all I think it already looks quite good. Concerning
> > the
> > > > > first
> > > > > > > open
> > > > > > > > > question about allocating memory segments, I was wondering
> > > > whether
> > > > > > this
> > > > > > > > is
> > > > > > > > > strictly necessary to do in the context of this FLIP or
> > whether
> > > > > this
> > > > > > > > could
> > > > > > > > > be done as a follow up? Without knowing all details, I
> would
> > be
> > > > > > > concerned
> > > > > > > > > that we would widen the scope of this FLIP too much because
> > we
> > > > > would
> > > > > > > have
> > > > > > > > > to touch all the existing call sites of the MemoryManager
> > where
> > > > we
> > > > > > > > allocate
> > > > > > > > > memory segments (this should mainly be batch operators).
> The
> > > > > addition
> > > > > > > of
> > > > > > > > > the memory reservation call to the MemoryManager should not
> > be
> > > > > > affected
> > > > > > > > by
> > > > > > > > > this and I would hope that this is the only point of
> > > interaction
> > > > a
> > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > >
> > > > > > > > > Concerning the second open question about setting or not
> > > setting
> > > > a
> > > > > > max
> > > > > > > > > direct memory limit, I would also be interested why Yang
> Wang
> > > > > thinks
> > > > > > > > > leaving it open would be best. My concern about this would
> be
> > > > that
> > > > > we
> > > > > > > > would
> > > > > > > > > be in a similar situation as we are now with the
> > > > > RocksDBStateBackend.
> > > > > > > If
> > > > > > > > > the different memory pools are not clearly separated and
> can
> > > > spill
> > > > > > over
> > > > > > > > to
> > > > > > > > > a different pool, then it is quite hard to understand what
> > > > exactly
> > > > > > > > causes a
> > > > > > > > > process to get killed for using too much memory. This could
> > > then
> > > > > > easily
> > > > > > > > > lead to a similar situation what we have with the
> > cutoff-ratio.
> > > > So
> > > > > > why
> > > > > > > > not
> > > > > > > > > setting a sane default value for max direct memory and
> giving
> > > the
> > > > > > user
> > > > > > > an
> > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > >
> > > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > > utilization
> > > > > > than
> > > > > > > > > alternative 3 where we set the direct memory to a higher
> > value?
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Till
> > > > > > > > >
> > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > [hidden email]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > >
> > > > > > > > > > Regarding your comments:
> > > > > > > > > >
> > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > I think setting a very large max direct memory size
> > > definitely
> > > > > has
> > > > > > > some
> > > > > > > > > > good sides. E.g., we do not worry about direct OOM, and
> we
> > > > don't
> > > > > > even
> > > > > > > > > need
> > > > > > > > > > to allocate managed / network memory with
> > Unsafe.allocate() .
> > > > > > > > > > However, there are also some down sides of doing this.
> > > > > > > > > >
> > > > > > > > > >    - One thing I can think of is that if a task executor
> > > > > container
> > > > > > is
> > > > > > > > > >    killed due to overusing memory, it could be hard for
> use
> > > to
> > > > > know
> > > > > > > > which
> > > > > > > > > > part
> > > > > > > > > >    of the memory is overused.
> > > > > > > > > >    - Another down side is that the JVM never trigger GC
> due
> > > to
> > > > > > > reaching
> > > > > > > > > max
> > > > > > > > > >    direct memory limit, because the limit is too high to
> be
> > > > > > reached.
> > > > > > > > That
> > > > > > > > > >    means we kind of relay on heap memory to trigger GC
> and
> > > > > release
> > > > > > > > direct
> > > > > > > > > >    memory. That could be a problem in cases where we have
> > > more
> > > > > > direct
> > > > > > > > > > memory
> > > > > > > > > >    usage but not enough heap activity to trigger the GC.
> > > > > > > > > >
> > > > > > > > > > Maybe you can share your reasons for preferring setting a
> > > very
> > > > > > large
> > > > > > > > > value,
> > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > >
> > > > > > > > > > *Memory Calculation*
> > > > > > > > > > If there is any conflict between multiple configuration
> > that
> > > > user
> > > > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > > > I think doing checking on the client side is a good idea,
> > so
> > > > that
> > > > > > on
> > > > > > > > > Yarn /
> > > > > > > > > > K8s we can discover the problem before submitting the
> Flink
> > > > > > cluster,
> > > > > > > > > which
> > > > > > > > > > is always a good thing.
> > > > > > > > > > But we can not only rely on the client side checking,
> > because
> > > > for
> > > > > > > > > > standalone cluster TaskManagers on different machines may
> > > have
> > > > > > > > different
> > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi xintong,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your detailed proposal. After all the memory
> > > > > > > configuration
> > > > > > > > > are
> > > > > > > > > > > introduced, it will be more powerful to control the
> flink
> > > > > memory
> > > > > > > > > usage. I
> > > > > > > > > > > just have few questions about it.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > >
> > > > > > > > > > > We do not differentiate user direct memory and native
> > > memory.
> > > > > > They
> > > > > > > > are
> > > > > > > > > > all
> > > > > > > > > > > included in task off-heap memory. Right? So i don’t
> think
> > > we
> > > > > > could
> > > > > > > > not
> > > > > > > > > > set
> > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving
> > it a
> > > > > very
> > > > > > > > large
> > > > > > > > > > > value.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > >
> > > > > > > > > > > If the sum of and fine-grained memory(network memory,
> > > managed
> > > > > > > memory,
> > > > > > > > > > etc.)
> > > > > > > > > > > is larger than total process memory, how do we deal
> with
> > > this
> > > > > > > > > situation?
> > > > > > > > > > Do
> > > > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> > > 下午10:14写道:
> > > > > > > > > > >
> > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > >
> > > > > > > > > > > > We would like to start a discussion thread on
> "FLIP-49:
> > > > > Unified
> > > > > > > > > Memory
> > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> describe
> > > how
> > > > to
> > > > > > > > improve
> > > > > > > > > > > > TaskExecutor memory configurations. The FLIP document
> > is
> > > > > mostly
> > > > > > > > based
> > > > > > > > > > on
> > > > > > > > > > > an
> > > > > > > > > > > > early design "Memory Management and Configuration
> > > > > Reloaded"[2]
> > > > > > by
> > > > > > > > > > > Stephan,
> > > > > > > > > > > > with updates from follow-up discussions both online
> and
> > > > > > offline.
> > > > > > > > > > > >
> > > > > > > > > > > > This FLIP addresses several shortcomings of current
> > > (Flink
> > > > > 1.9)
> > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > >
> > > > > > > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > > > > > > >    - Complex and difficult configuration of RocksDB
> in
> > > > > > Streaming.
> > > > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Key changes to solve the problems can be summarized
> as
> > > > > follows.
> > > > > > > > > > > >
> > > > > > > > > > > >    - Extend memory manager to also account for memory
> > > usage
> > > > > by
> > > > > > > > state
> > > > > > > > > > > >    backends.
> > > > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > > > accounted
> > > > > > > > > individual
> > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > >    - Simplify memory configuration options and
> > > calculations
> > > > > > > logics.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Please find more details in the FLIP wiki document
> [1].
> > > > > > > > > > > >
> > > > > > > > > > > > (Please note that the early design doc [2] is out of
> > > sync,
> > > > > and
> > > > > > it
> > > > > > > > is
> > > > > > > > > > > > appreciated to have the discussion in this mailing
> list
> > > > > > thread.)
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > >
> > > > > > > > > > > > [2]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Till Rohrmann
Quick question for option 1.1 Stephan: Does this variant entail that we
distinguish between native and direct memory off heap managed memory? If
this is the case, then it won't be possible for users to run a streaming
job using RocksDB and a batch DataSet job on the same session cluster
unless they have configured the off heap managed memory to be twofold (e.g.
50% native, 50% direct memory).

Cheers,
Till

On Mon, Aug 19, 2019 at 4:21 PM JingsongLee <[hidden email]>
wrote:

> Hi stephan:
>
> About option 2:
>
> if additional threads not cleanly shut down before we can exit the task:
> In the current case of memory reuse, it has freed up the memory it
>  uses. If this memory is used by other tasks and asynchronous threads
>  of exited task may still be writing, there will be concurrent security
>  problems, and even lead to errors in user computing results.
>
> So I think this is a serious and intolerable bug, No matter what the
>  option is, it should be avoided.
>
> About direct memory cleaned by GC:
> I don't think it is a good idea, I've encountered so many situations
>  that it's too late for GC to cause DirectMemory OOM. Release and
>  allocate DirectMemory depend on the type of user job, which is
>  often beyond our control.
>
> Best,
> Jingsong Lee
>
>
> ------------------------------------------------------------------
> From:Stephan Ewen <[hidden email]>
> Send Time:2019年8月19日(星期一) 15:56
> To:dev <[hidden email]>
> Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for
> TaskExecutors
>
> My main concern with option 2 (manually release memory) is that segfaults
> in the JVM send off all sorts of alarms on user ends. So we need to
> guarantee that this never happens.
>
> The trickyness is in tasks that uses data structures / algorithms with
> additional threads, like hash table spill/read and sorting threads. We need
> to ensure that these cleanly shut down before we can exit the task.
> I am not sure that we have that guaranteed already, that's why option 1.1
> seemed simpler to me.
>
> On Mon, Aug 19, 2019 at 3:42 PM Xintong Song <[hidden email]>
> wrote:
>
> > Thanks for the comments, Stephan. Summarized in this way really makes
> > things easier to understand.
> >
> > I'm in favor of option 2, at least for the moment. I think it is not that
> > difficult to keep it segfault safe for memory manager, as long as we
> always
> > de-allocate the memory segment when it is released from the memory
> > consumers. Only if the memory consumer continue using the buffer of
> memory
> > segment after releasing it, in which case we do want the job to fail so
> we
> > detect the memory leak early.
> >
> > For option 1.2, I don't think this is a good idea. Not only because the
> > assumption (regular GC is enough to clean direct buffers) may not always
> be
> > true, but also it makes harder for finding problems in cases of memory
> > overuse. E.g., user configured some direct memory for the user libraries.
> > If the library actually use more direct memory then configured, which
> > cannot be cleaned by GC because they are still in use, may lead to
> overuse
> > of the total container memory. In that case, if it didn't touch the JVM
> > default max direct memory limit, we cannot get a direct memory OOM and it
> > will become super hard to understand which part of the configuration need
> > to be updated.
> >
> > For option 1.1, it has the similar problem as 1.2, if the exceeded direct
> > memory does not reach the max direct memory limit specified by the
> > dedicated parameter. I think it is slightly better than 1.2, only because
> > we can tune the parameter.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen <[hidden email]> wrote:
> >
> > > About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize
> > it a
> > > bit differently:
> > >
> > > We have the following two options:
> > >
> > > (1) We let MemorySegments be de-allocated by the GC. That makes it
> > segfault
> > > safe. But then we need a way to trigger GC in case de-allocation and
> > > re-allocation of a bunch of segments happens quickly, which is often
> the
> > > case during batch scheduling or task restart.
> > >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
> > >   - Another way could be to have a dedicated bookkeeping in the
> > > MemoryManager (option 1.2), so that this is a number independent of the
> > > "-XX:MaxDirectMemorySize" parameter.
> > >
> > > (2) We manually allocate and de-allocate the memory for the
> > MemorySegments
> > > (option 2). That way we need not worry about triggering GC by some
> > > threshold or bookkeeping, but it is harder to prevent segfaults. We
> need
> > to
> > > be very careful about when we release the memory segments (only in the
> > > cleanup phase of the main thread).
> > >
> > > If we go with option 1.1, we probably need to set
> > > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + direct_memory"
> > and
> > > have "direct_memory" as a separate reserved memory pool. Because if we
> > just
> > > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> > jvm_overhead",
> > > then there will be times when that entire memory is allocated by direct
> > > buffers and we have nothing left for the JVM overhead. So we either
> need
> > a
> > > way to compensate for that (again some safety margin cutoff value) or
> we
> > > will exceed container memory.
> > >
> > > If we go with option 1.2, we need to be aware that it takes elaborate
> > logic
> > > to push recycling of direct buffers without always triggering a full
> GC.
> > >
> > >
> > > My first guess is that the options will be easiest to do in the
> following
> > > order:
> > >
> > >   - Option 1.1 with a dedicated direct_memory parameter, as discussed
> > > above. We would need to find a way to set the direct_memory parameter
> by
> > > default. We could start with 64 MB and see how it goes in practice. One
> > > danger I see is that setting this loo low can cause a bunch of
> additional
> > > GCs compared to before (we need to watch this carefully).
> > >
> > >   - Option 2. It is actually quite simple to implement, we could try
> how
> > > segfault safe we are at the moment.
> > >
> > >   - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize"
> > parameter
> > > at all and assume that all the direct memory allocations that the JVM
> and
> > > Netty do are infrequent enough to be cleaned up fast enough through
> > regular
> > > GC. I am not sure if that is a valid assumption, though.
> > >
> > > Best,
> > > Stephan
> > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for sharing your opinion Till.
> > > >
> > > > I'm also in favor of alternative 2. I was wondering whether we can
> > avoid
> > > > using Unsafe.allocate() for off-heap managed memory and network
> memory
> > > with
> > > > alternative 3. But after giving it a second thought, I think even for
> > > > alternative 3 using direct memory for off-heap managed memory could
> > cause
> > > > problems.
> > > >
> > > > Hi Yang,
> > > >
> > > > Regarding your concern, I think what proposed in this FLIP it to have
> > > both
> > > > off-heap managed memory and network memory allocated through
> > > > Unsafe.allocate(), which means they are practically native memory and
> > not
> > > > limited by JVM max direct memory. The only parts of memory limited by
> > JVM
> > > > max direct memory are task off-heap memory and JVM overhead, which
> are
> > > > exactly alternative 2 suggests to set the JVM max direct memory to.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]>
> > > > wrote:
> > > >
> > > > > Thanks for the clarification Xintong. I understand the two
> > alternatives
> > > > > now.
> > > > >
> > > > > I would be in favour of option 2 because it makes things explicit.
> If
> > > we
> > > > > don't limit the direct memory, I fear that we might end up in a
> > similar
> > > > > situation as we are currently in: The user might see that her
> process
> > > > gets
> > > > > killed by the OS and does not know why this is the case.
> > Consequently,
> > > > she
> > > > > tries to decrease the process memory size (similar to increasing
> the
> > > > cutoff
> > > > > ratio) in order to accommodate for the extra direct memory. Even
> > worse,
> > > > she
> > > > > tries to decrease memory budgets which are not fully used and hence
> > > won't
> > > > > change the overall memory consumption.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > Let me explain this with a concrete example Till.
> > > > > >
> > > > > > Let's say we have the following scenario.
> > > > > >
> > > > > > Total Process Memory: 1GB
> > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed
> > Memory
> > > > and
> > > > > > Network Memory): 800MB
> > > > > >
> > > > > >
> > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> > > > value,
> > > > > > let's say 1TB.
> > > > > >
> > > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > Overhead
> > > > > > do not exceed 200MB, then alternative 2 and alternative 3 should
> > have
> > > > the
> > > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> > reduce
> > > > the
> > > > > > sizes of the other memory pools.
> > > > > >
> > > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > > Overhead potentially exceed 200MB, then
> > > > > >
> > > > > >    - Alternative 2 suffers from frequent OOM. To avoid that, the
> > only
> > > > > thing
> > > > > >    user can do is to modify the configuration and increase JVM
> > Direct
> > > > > > Memory
> > > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > > increases
> > > > > JVM
> > > > > >    Direct Memory to 250MB, this will reduce the total size of
> other
> > > > > memory
> > > > > >    pools to 750MB, given the total process memory remains 1GB.
> > > > > >    - For alternative 3, there is no chance of direct OOM. There
> are
> > > > > chances
> > > > > >    of exceeding the total process memory limit, but given that
> the
> > > > > process
> > > > > > may
> > > > > >    not use up all the reserved native memory (Off-Heap Managed
> > > Memory,
> > > > > > Network
> > > > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > > > slightly
> > > > > > above
> > > > > >    yet very close to 200MB, user probably do not need to change
> the
> > > > > >    configurations.
> > > > > >
> > > > > > Therefore, I think from the user's perspective, a feasible
> > > > configuration
> > > > > > for alternative 2 may lead to lower resource utilization compared
> > to
> > > > > > alternative 3.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I guess you have to help me understand the difference between
> > > > > > alternative 2
> > > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > > >
> > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap
> > Memory
> > > > and
> > > > > > JVM
> > > > > > > Overhead. Then there is the risk that this size is too low
> > > resulting
> > > > > in a
> > > > > > > lot of garbage collection and potentially an OOM.
> > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger
> > > than
> > > > > > > alternative 2. This would of course reduce the sizes of the
> other
> > > > > memory
> > > > > > > types.
> > > > > > >
> > > > > > > How would alternative 2 now result in an under utilization of
> > > memory
> > > > > > > compared to alternative 3? If alternative 3 strictly sets a
> > higher
> > > > max
> > > > > > > direct memory size and we use only little, then I would expect
> > that
> > > > > > > alternative 3 results in memory under utilization.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <
> [hidden email]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi xintong,till
> > > > > > > >
> > > > > > > >
> > > > > > > > > Native and Direct Memory
> > > > > > > >
> > > > > > > > My point is setting a very large max direct memory size when
> we
> > > do
> > > > > not
> > > > > > > > differentiate direct and native memory. If the direct
> > > > > memory,including
> > > > > > > user
> > > > > > > > direct memory and framework direct memory,could be calculated
> > > > > > > > correctly,then
> > > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Memory Calculation
> > > > > > > >
> > > > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> > > memory
> > > > > > > > configurations in client to avoid submitting successfully and
> > > > failing
> > > > > > in
> > > > > > > > the flink master.
> > > > > > > >
> > > > > > > >
> > > > > > > > Best,
> > > > > > > >
> > > > > > > > Yang
> > > > > > > >
> > > > > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > > > > >
> > > > > > > > > Thanks for replying, Till.
> > > > > > > > >
> > > > > > > > > About MemorySegment, I think you are right that we should
> not
> > > > > include
> > > > > > > > this
> > > > > > > > > issue in the scope of this FLIP. This FLIP should
> concentrate
> > > on
> > > > > how
> > > > > > to
> > > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > > involvement
> > > > > on
> > > > > > > how
> > > > > > > > > memory consumers use it.
> > > > > > > > >
> > > > > > > > > About direct memory, I think alternative 3 may not having
> the
> > > > same
> > > > > > over
> > > > > > > > > reservation issue that alternative 2 does, but at the cost
> of
> > > > risk
> > > > > of
> > > > > > > > over
> > > > > > > > > using memory at the container level, which is not good. My
> > > point
> > > > is
> > > > > > > that
> > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy
> > to
> > > > > > config.
> > > > > > > > For
> > > > > > > > > alternative 2, users might configure them higher than what
> > > > actually
> > > > > > > > needed,
> > > > > > > > > just to avoid getting a direct OOM. For alternative 3,
> users
> > do
> > > > not
> > > > > > get
> > > > > > > > > direct OOM, so they may not config the two options
> > aggressively
> > > > > high.
> > > > > > > But
> > > > > > > > > the consequences are risks of overall container memory
> usage
> > > > > exceeds
> > > > > > > the
> > > > > > > > > budget.
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > > >
> > > > > > > > > > All in all I think it already looks quite good.
> Concerning
> > > the
> > > > > > first
> > > > > > > > open
> > > > > > > > > > question about allocating memory segments, I was
> wondering
> > > > > whether
> > > > > > > this
> > > > > > > > > is
> > > > > > > > > > strictly necessary to do in the context of this FLIP or
> > > whether
> > > > > > this
> > > > > > > > > could
> > > > > > > > > > be done as a follow up? Without knowing all details, I
> > would
> > > be
> > > > > > > > concerned
> > > > > > > > > > that we would widen the scope of this FLIP too much
> because
> > > we
> > > > > > would
> > > > > > > > have
> > > > > > > > > > to touch all the existing call sites of the MemoryManager
> > > where
> > > > > we
> > > > > > > > > allocate
> > > > > > > > > > memory segments (this should mainly be batch operators).
> > The
> > > > > > addition
> > > > > > > > of
> > > > > > > > > > the memory reservation call to the MemoryManager should
> not
> > > be
> > > > > > > affected
> > > > > > > > > by
> > > > > > > > > > this and I would hope that this is the only point of
> > > > interaction
> > > > > a
> > > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > > >
> > > > > > > > > > Concerning the second open question about setting or not
> > > > setting
> > > > > a
> > > > > > > max
> > > > > > > > > > direct memory limit, I would also be interested why Yang
> > Wang
> > > > > > thinks
> > > > > > > > > > leaving it open would be best. My concern about this
> would
> > be
> > > > > that
> > > > > > we
> > > > > > > > > would
> > > > > > > > > > be in a similar situation as we are now with the
> > > > > > RocksDBStateBackend.
> > > > > > > > If
> > > > > > > > > > the different memory pools are not clearly separated and
> > can
> > > > > spill
> > > > > > > over
> > > > > > > > > to
> > > > > > > > > > a different pool, then it is quite hard to understand
> what
> > > > > exactly
> > > > > > > > > causes a
> > > > > > > > > > process to get killed for using too much memory. This
> could
> > > > then
> > > > > > > easily
> > > > > > > > > > lead to a similar situation what we have with the
> > > cutoff-ratio.
> > > > > So
> > > > > > > why
> > > > > > > > > not
> > > > > > > > > > setting a sane default value for max direct memory and
> > giving
> > > > the
> > > > > > > user
> > > > > > > > an
> > > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > > >
> > > > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > > > utilization
> > > > > > > than
> > > > > > > > > > alternative 3 where we set the direct memory to a higher
> > > value?
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Till
> > > > > > > > > >
> > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > > [hidden email]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > > >
> > > > > > > > > > > Regarding your comments:
> > > > > > > > > > >
> > > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > > I think setting a very large max direct memory size
> > > > definitely
> > > > > > has
> > > > > > > > some
> > > > > > > > > > > good sides. E.g., we do not worry about direct OOM, and
> > we
> > > > > don't
> > > > > > > even
> > > > > > > > > > need
> > > > > > > > > > > to allocate managed / network memory with
> > > Unsafe.allocate() .
> > > > > > > > > > > However, there are also some down sides of doing this.
> > > > > > > > > > >
> > > > > > > > > > >    - One thing I can think of is that if a task
> executor
> > > > > > container
> > > > > > > is
> > > > > > > > > > >    killed due to overusing memory, it could be hard for
> > use
> > > > to
> > > > > > know
> > > > > > > > > which
> > > > > > > > > > > part
> > > > > > > > > > >    of the memory is overused.
> > > > > > > > > > >    - Another down side is that the JVM never trigger GC
> > due
> > > > to
> > > > > > > > reaching
> > > > > > > > > > max
> > > > > > > > > > >    direct memory limit, because the limit is too high
> to
> > be
> > > > > > > reached.
> > > > > > > > > That
> > > > > > > > > > >    means we kind of relay on heap memory to trigger GC
> > and
> > > > > > release
> > > > > > > > > direct
> > > > > > > > > > >    memory. That could be a problem in cases where we
> have
> > > > more
> > > > > > > direct
> > > > > > > > > > > memory
> > > > > > > > > > >    usage but not enough heap activity to trigger the
> GC.
> > > > > > > > > > >
> > > > > > > > > > > Maybe you can share your reasons for preferring
> setting a
> > > > very
> > > > > > > large
> > > > > > > > > > value,
> > > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > > >
> > > > > > > > > > > *Memory Calculation*
> > > > > > > > > > > If there is any conflict between multiple configuration
> > > that
> > > > > user
> > > > > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > > > > I think doing checking on the client side is a good
> idea,
> > > so
> > > > > that
> > > > > > > on
> > > > > > > > > > Yarn /
> > > > > > > > > > > K8s we can discover the problem before submitting the
> > Flink
> > > > > > > cluster,
> > > > > > > > > > which
> > > > > > > > > > > is always a good thing.
> > > > > > > > > > > But we can not only rely on the client side checking,
> > > because
> > > > > for
> > > > > > > > > > > standalone cluster TaskManagers on different machines
> may
> > > > have
> > > > > > > > > different
> > > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > > What do you think?
> > > > > > > > > > >
> > > > > > > > > > > Thank you~
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi xintong,
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for your detailed proposal. After all the
> memory
> > > > > > > > configuration
> > > > > > > > > > are
> > > > > > > > > > > > introduced, it will be more powerful to control the
> > flink
> > > > > > memory
> > > > > > > > > > usage. I
> > > > > > > > > > > > just have few questions about it.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > > >
> > > > > > > > > > > > We do not differentiate user direct memory and native
> > > > memory.
> > > > > > > They
> > > > > > > > > are
> > > > > > > > > > > all
> > > > > > > > > > > > included in task off-heap memory. Right? So i don’t
> > think
> > > > we
> > > > > > > could
> > > > > > > > > not
> > > > > > > > > > > set
> > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer
> leaving
> > > it a
> > > > > > very
> > > > > > > > > large
> > > > > > > > > > > > value.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > > >
> > > > > > > > > > > > If the sum of and fine-grained memory(network memory,
> > > > managed
> > > > > > > > memory,
> > > > > > > > > > > etc.)
> > > > > > > > > > > > is larger than total process memory, how do we deal
> > with
> > > > this
> > > > > > > > > > situation?
> > > > > > > > > > > Do
> > > > > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> > > > 下午10:14写道:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > >
> > > > > > > > > > > > > We would like to start a discussion thread on
> > "FLIP-49:
> > > > > > Unified
> > > > > > > > > > Memory
> > > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> > describe
> > > > how
> > > > > to
> > > > > > > > > improve
> > > > > > > > > > > > > TaskExecutor memory configurations. The FLIP
> document
> > > is
> > > > > > mostly
> > > > > > > > > based
> > > > > > > > > > > on
> > > > > > > > > > > > an
> > > > > > > > > > > > > early design "Memory Management and Configuration
> > > > > > Reloaded"[2]
> > > > > > > by
> > > > > > > > > > > > Stephan,
> > > > > > > > > > > > > with updates from follow-up discussions both online
> > and
> > > > > > > offline.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This FLIP addresses several shortcomings of current
> > > > (Flink
> > > > > > 1.9)
> > > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Different configuration for Streaming and
> Batch.
> > > > > > > > > > > > >    - Complex and difficult configuration of RocksDB
> > in
> > > > > > > Streaming.
> > > > > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Key changes to solve the problems can be summarized
> > as
> > > > > > follows.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Extend memory manager to also account for
> memory
> > > > usage
> > > > > > by
> > > > > > > > > state
> > > > > > > > > > > > >    backends.
> > > > > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > > > > accounted
> > > > > > > > > > individual
> > > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > > >    - Simplify memory configuration options and
> > > > calculations
> > > > > > > > logics.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please find more details in the FLIP wiki document
> > [1].
> > > > > > > > > > > > >
> > > > > > > > > > > > > (Please note that the early design doc [2] is out
> of
> > > > sync,
> > > > > > and
> > > > > > > it
> > > > > > > > > is
> > > > > > > > > > > > > appreciated to have the discussion in this mailing
> > list
> > > > > > > thread.)
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > > >
> > > > > > > > > > > > > [2]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for sharing your opinion Till.
> > > >
> > > > I'm also in favor of alternative 2. I was wondering whether we can
> > avoid
> > > > using Unsafe.allocate() for off-heap managed memory and network
> memory
> > > with
> > > > alternative 3. But after giving it a second thought, I think even for
> > > > alternative 3 using direct memory for off-heap managed memory could
> > cause
> > > > problems.
> > > >
> > > > Hi Yang,
> > > >
> > > > Regarding your concern, I think what proposed in this FLIP it to have
> > > both
> > > > off-heap managed memory and network memory allocated through
> > > > Unsafe.allocate(), which means they are practically native memory and
> > not
> > > > limited by JVM max direct memory. The only parts of memory limited by
> > JVM
> > > > max direct memory are task off-heap memory and JVM overhead, which
> are
> > > > exactly alternative 2 suggests to set the JVM max direct memory to.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]>
> > > > wrote:
> > > >
> > > > > Thanks for the clarification Xintong. I understand the two
> > alternatives
> > > > > now.
> > > > >
> > > > > I would be in favour of option 2 because it makes things explicit.
> If
> > > we
> > > > > don't limit the direct memory, I fear that we might end up in a
> > similar
> > > > > situation as we are currently in: The user might see that her
> process
> > > > gets
> > > > > killed by the OS and does not know why this is the case.
> > Consequently,
> > > > she
> > > > > tries to decrease the process memory size (similar to increasing
> the
> > > > cutoff
> > > > > ratio) in order to accommodate for the extra direct memory. Even
> > worse,
> > > > she
> > > > > tries to decrease memory budgets which are not fully used and hence
> > > won't
> > > > > change the overall memory consumption.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > Let me explain this with a concrete example Till.
> > > > > >
> > > > > > Let's say we have the following scenario.
> > > > > >
> > > > > > Total Process Memory: 1GB
> > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed
> > Memory
> > > > and
> > > > > > Network Memory): 800MB
> > > > > >
> > > > > >
> > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> > > > value,
> > > > > > let's say 1TB.
> > > > > >
> > > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > Overhead
> > > > > > do not exceed 200MB, then alternative 2 and alternative 3 should
> > have
> > > > the
> > > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> > reduce
> > > > the
> > > > > > sizes of the other memory pools.
> > > > > >
> > > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > > Overhead potentially exceed 200MB, then
> > > > > >
> > > > > >    - Alternative 2 suffers from frequent OOM. To avoid that, the
> > only
> > > > > thing
> > > > > >    user can do is to modify the configuration and increase JVM
> > Direct
> > > > > > Memory
> > > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > > increases
> > > > > JVM
> > > > > >    Direct Memory to 250MB, this will reduce the total size of
> other
> > > > > memory
> > > > > >    pools to 750MB, given the total process memory remains 1GB.
> > > > > >    - For alternative 3, there is no chance of direct OOM. There
> are
> > > > > chances
> > > > > >    of exceeding the total process memory limit, but given that
> the
> > > > > process
> > > > > > may
> > > > > >    not use up all the reserved native memory (Off-Heap Managed
> > > Memory,
> > > > > > Network
> > > > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > > > slightly
> > > > > > above
> > > > > >    yet very close to 200MB, user probably do not need to change
> the
> > > > > >    configurations.
> > > > > >
> > > > > > Therefore, I think from the user's perspective, a feasible
> > > > configuration
> > > > > > for alternative 2 may lead to lower resource utilization compared
> > to
> > > > > > alternative 3.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I guess you have to help me understand the difference between
> > > > > > alternative 2
> > > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > > >
> > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap
> > Memory
> > > > and
> > > > > > JVM
> > > > > > > Overhead. Then there is the risk that this size is too low
> > > resulting
> > > > > in a
> > > > > > > lot of garbage collection and potentially an OOM.
> > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger
> > > than
> > > > > > > alternative 2. This would of course reduce the sizes of the
> other
> > > > > memory
> > > > > > > types.
> > > > > > >
> > > > > > > How would alternative 2 now result in an under utilization of
> > > memory
> > > > > > > compared to alternative 3? If alternative 3 strictly sets a
> > higher
> > > > max
> > > > > > > direct memory size and we use only little, then I would expect
> > that
> > > > > > > alternative 3 results in memory under utilization.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <
> [hidden email]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi xintong,till
> > > > > > > >
> > > > > > > >
> > > > > > > > > Native and Direct Memory
> > > > > > > >
> > > > > > > > My point is setting a very large max direct memory size when
> we
> > > do
> > > > > not
> > > > > > > > differentiate direct and native memory. If the direct
> > > > > memory,including
> > > > > > > user
> > > > > > > > direct memory and framework direct memory,could be calculated
> > > > > > > > correctly,then
> > > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Memory Calculation
> > > > > > > >
> > > > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> > > memory
> > > > > > > > configurations in client to avoid submitting successfully and
> > > > failing
> > > > > > in
> > > > > > > > the flink master.
> > > > > > > >
> > > > > > > >
> > > > > > > > Best,
> > > > > > > >
> > > > > > > > Yang
> > > > > > > >
> > > > > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > > > > >
> > > > > > > > > Thanks for replying, Till.
> > > > > > > > >
> > > > > > > > > About MemorySegment, I think you are right that we should
> not
> > > > > include
> > > > > > > > this
> > > > > > > > > issue in the scope of this FLIP. This FLIP should
> concentrate
> > > on
> > > > > how
> > > > > > to
> > > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > > involvement
> > > > > on
> > > > > > > how
> > > > > > > > > memory consumers use it.
> > > > > > > > >
> > > > > > > > > About direct memory, I think alternative 3 may not having
> the
> > > > same
> > > > > > over
> > > > > > > > > reservation issue that alternative 2 does, but at the cost
> of
> > > > risk
> > > > > of
> > > > > > > > over
> > > > > > > > > using memory at the container level, which is not good. My
> > > point
> > > > is
> > > > > > > that
> > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy
> > to
> > > > > > config.
> > > > > > > > For
> > > > > > > > > alternative 2, users might configure them higher than what
> > > > actually
> > > > > > > > needed,
> > > > > > > > > just to avoid getting a direct OOM. For alternative 3,
> users
> > do
> > > > not
> > > > > > get
> > > > > > > > > direct OOM, so they may not config the two options
> > aggressively
> > > > > high.
> > > > > > > But
> > > > > > > > > the consequences are risks of overall container memory
> usage
> > > > > exceeds
> > > > > > > the
> > > > > > > > > budget.
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > > >
> > > > > > > > > > All in all I think it already looks quite good.
> Concerning
> > > the
> > > > > > first
> > > > > > > > open
> > > > > > > > > > question about allocating memory segments, I was
> wondering
> > > > > whether
> > > > > > > this
> > > > > > > > > is
> > > > > > > > > > strictly necessary to do in the context of this FLIP or
> > > whether
> > > > > > this
> > > > > > > > > could
> > > > > > > > > > be done as a follow up? Without knowing all details, I
> > would
> > > be
> > > > > > > > concerned
> > > > > > > > > > that we would widen the scope of this FLIP too much
> because
> > > we
> > > > > > would
> > > > > > > > have
> > > > > > > > > > to touch all the existing call sites of the MemoryManager
> > > where
> > > > > we
> > > > > > > > > allocate
> > > > > > > > > > memory segments (this should mainly be batch operators).
> > The
> > > > > > addition
> > > > > > > > of
> > > > > > > > > > the memory reservation call to the MemoryManager should
> not
> > > be
> > > > > > > affected
> > > > > > > > > by
> > > > > > > > > > this and I would hope that this is the only point of
> > > > interaction
> > > > > a
> > > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > > >
> > > > > > > > > > Concerning the second open question about setting or not
> > > > setting
> > > > > a
> > > > > > > max
> > > > > > > > > > direct memory limit, I would also be interested why Yang
> > Wang
> > > > > > thinks
> > > > > > > > > > leaving it open would be best. My concern about this
> would
> > be
> > > > > that
> > > > > > we
> > > > > > > > > would
> > > > > > > > > > be in a similar situation as we are now with the
> > > > > > RocksDBStateBackend.
> > > > > > > > If
> > > > > > > > > > the different memory pools are not clearly separated and
> > can
> > > > > spill
> > > > > > > over
> > > > > > > > > to
> > > > > > > > > > a different pool, then it is quite hard to understand
> what
> > > > > exactly
> > > > > > > > > causes a
> > > > > > > > > > process to get killed for using too much memory. This
> could
> > > > then
> > > > > > > easily
> > > > > > > > > > lead to a similar situation what we have with the
> > > cutoff-ratio.
> > > > > So
> > > > > > > why
> > > > > > > > > not
> > > > > > > > > > setting a sane default value for max direct memory and
> > giving
> > > > the
> > > > > > > user
> > > > > > > > an
> > > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > > >
> > > > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > > > utilization
> > > > > > > than
> > > > > > > > > > alternative 3 where we set the direct memory to a higher
> > > value?
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Till
> > > > > > > > > >
> > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > > [hidden email]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > > >
> > > > > > > > > > > Regarding your comments:
> > > > > > > > > > >
> > > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > > I think setting a very large max direct memory size
> > > > definitely
> > > > > > has
> > > > > > > > some
> > > > > > > > > > > good sides. E.g., we do not worry about direct OOM, and
> > we
> > > > > don't
> > > > > > > even
> > > > > > > > > > need
> > > > > > > > > > > to allocate managed / network memory with
> > > Unsafe.allocate() .
> > > > > > > > > > > However, there are also some down sides of doing this.
> > > > > > > > > > >
> > > > > > > > > > >    - One thing I can think of is that if a task
> executor
> > > > > > container
> > > > > > > is
> > > > > > > > > > >    killed due to overusing memory, it could be hard for
> > use
> > > > to
> > > > > > know
> > > > > > > > > which
> > > > > > > > > > > part
> > > > > > > > > > >    of the memory is overused.
> > > > > > > > > > >    - Another down side is that the JVM never trigger GC
> > due
> > > > to
> > > > > > > > reaching
> > > > > > > > > > max
> > > > > > > > > > >    direct memory limit, because the limit is too high
> to
> > be
> > > > > > > reached.
> > > > > > > > > That
> > > > > > > > > > >    means we kind of relay on heap memory to trigger GC
> > and
> > > > > > release
> > > > > > > > > direct
> > > > > > > > > > >    memory. That could be a problem in cases where we
> have
> > > > more
> > > > > > > direct
> > > > > > > > > > > memory
> > > > > > > > > > >    usage but not enough heap activity to trigger the
> GC.
> > > > > > > > > > >
> > > > > > > > > > > Maybe you can share your reasons for preferring
> setting a
> > > > very
> > > > > > > large
> > > > > > > > > > value,
> > > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > > >
> > > > > > > > > > > *Memory Calculation*
> > > > > > > > > > > If there is any conflict between multiple configuration
> > > that
> > > > > user
> > > > > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > > > > I think doing checking on the client side is a good
> idea,
> > > so
> > > > > that
> > > > > > > on
> > > > > > > > > > Yarn /
> > > > > > > > > > > K8s we can discover the problem before submitting the
> > Flink
> > > > > > > cluster,
> > > > > > > > > > which
> > > > > > > > > > > is always a good thing.
> > > > > > > > > > > But we can not only rely on the client side checking,
> > > because
> > > > > for
> > > > > > > > > > > standalone cluster TaskManagers on different machines
> may
> > > > have
> > > > > > > > > different
> > > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > > What do you think?
> > > > > > > > > > >
> > > > > > > > > > > Thank you~
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi xintong,
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for your detailed proposal. After all the
> memory
> > > > > > > > configuration
> > > > > > > > > > are
> > > > > > > > > > > > introduced, it will be more powerful to control the
> > flink
> > > > > > memory
> > > > > > > > > > usage. I
> > > > > > > > > > > > just have few questions about it.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > > >
> > > > > > > > > > > > We do not differentiate user direct memory and native
> > > > memory.
> > > > > > > They
> > > > > > > > > are
> > > > > > > > > > > all
> > > > > > > > > > > > included in task off-heap memory. Right? So i don’t
> > think
> > > > we
> > > > > > > could
> > > > > > > > > not
> > > > > > > > > > > set
> > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer
> leaving
> > > it a
> > > > > > very
> > > > > > > > > large
> > > > > > > > > > > > value.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > > >
> > > > > > > > > > > > If the sum of and fine-grained memory(network memory,
> > > > managed
> > > > > > > > memory,
> > > > > > > > > > > etc.)
> > > > > > > > > > > > is larger than total process memory, how do we deal
> > with
> > > > this
> > > > > > > > > > situation?
> > > > > > > > > > > Do
> > > > > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> > > > 下午10:14写道:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > >
> > > > > > > > > > > > > We would like to start a discussion thread on
> > "FLIP-49:
> > > > > > Unified
> > > > > > > > > > Memory
> > > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> > describe
> > > > how
> > > > > to
> > > > > > > > > improve
> > > > > > > > > > > > > TaskExecutor memory configurations. The FLIP
> document
> > > is
> > > > > > mostly
> > > > > > > > > based
> > > > > > > > > > > on
> > > > > > > > > > > > an
> > > > > > > > > > > > > early design "Memory Management and Configuration
> > > > > > Reloaded"[2]
> > > > > > > by
> > > > > > > > > > > > Stephan,
> > > > > > > > > > > > > with updates from follow-up discussions both online
> > and
> > > > > > > offline.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This FLIP addresses several shortcomings of current
> > > > (Flink
> > > > > > 1.9)
> > > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Different configuration for Streaming and
> Batch.
> > > > > > > > > > > > >    - Complex and difficult configuration of RocksDB
> > in
> > > > > > > Streaming.
> > > > > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Key changes to solve the problems can be summarized
> > as
> > > > > > follows.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Extend memory manager to also account for
> memory
> > > > usage
> > > > > > by
> > > > > > > > > state
> > > > > > > > > > > > >    backends.
> > > > > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > > > > accounted
> > > > > > > > > > individual
> > > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > > >    - Simplify memory configuration options and
> > > > calculations
> > > > > > > > logics.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please find more details in the FLIP wiki document
> > [1].
> > > > > > > > > > > > >
> > > > > > > > > > > > > (Please note that the early design doc [2] is out
> of
> > > > sync,
> > > > > > and
> > > > > > > it
> > > > > > > > > is
> > > > > > > > > > > > > appreciated to have the discussion in this mailing
> > list
> > > > > > > thread.)
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > > >
> > > > > > > > > > > > > [2]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Xintong Song
In reply to this post by JingsongLee-2
Thanks for the inputs, Jingsong.

Let me try to summarize your points. Please correct me if I'm wrong.

   - Memory consumers should always avoid returning memory segments to
   memory manager while there are still un-cleaned structures / threads that
   may use the memory. Otherwise, it would cause serious problems by having
   multiple consumers trying to use the same memory segment.
   - JVM does not wait for GC when allocating direct memory buffer.
   Therefore even we set proper max direct memory size limit, we may still
   encounter direct memory oom if the GC cleaning memory slower than the
   direct memory allocation.

Am I understanding this correctly?

Thank you~

Xintong Song



On Mon, Aug 19, 2019 at 4:21 PM JingsongLee <[hidden email]>
wrote:

> Hi stephan:
>
> About option 2:
>
> if additional threads not cleanly shut down before we can exit the task:
> In the current case of memory reuse, it has freed up the memory it
>  uses. If this memory is used by other tasks and asynchronous threads
>  of exited task may still be writing, there will be concurrent security
>  problems, and even lead to errors in user computing results.
>
> So I think this is a serious and intolerable bug, No matter what the
>  option is, it should be avoided.
>
> About direct memory cleaned by GC:
> I don't think it is a good idea, I've encountered so many situations
>  that it's too late for GC to cause DirectMemory OOM. Release and
>  allocate DirectMemory depend on the type of user job, which is
>  often beyond our control.
>
> Best,
> Jingsong Lee
>
>
> ------------------------------------------------------------------
> From:Stephan Ewen <[hidden email]>
> Send Time:2019年8月19日(星期一) 15:56
> To:dev <[hidden email]>
> Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for
> TaskExecutors
>
> My main concern with option 2 (manually release memory) is that segfaults
> in the JVM send off all sorts of alarms on user ends. So we need to
> guarantee that this never happens.
>
> The trickyness is in tasks that uses data structures / algorithms with
> additional threads, like hash table spill/read and sorting threads. We need
> to ensure that these cleanly shut down before we can exit the task.
> I am not sure that we have that guaranteed already, that's why option 1.1
> seemed simpler to me.
>
> On Mon, Aug 19, 2019 at 3:42 PM Xintong Song <[hidden email]>
> wrote:
>
> > Thanks for the comments, Stephan. Summarized in this way really makes
> > things easier to understand.
> >
> > I'm in favor of option 2, at least for the moment. I think it is not that
> > difficult to keep it segfault safe for memory manager, as long as we
> always
> > de-allocate the memory segment when it is released from the memory
> > consumers. Only if the memory consumer continue using the buffer of
> memory
> > segment after releasing it, in which case we do want the job to fail so
> we
> > detect the memory leak early.
> >
> > For option 1.2, I don't think this is a good idea. Not only because the
> > assumption (regular GC is enough to clean direct buffers) may not always
> be
> > true, but also it makes harder for finding problems in cases of memory
> > overuse. E.g., user configured some direct memory for the user libraries.
> > If the library actually use more direct memory then configured, which
> > cannot be cleaned by GC because they are still in use, may lead to
> overuse
> > of the total container memory. In that case, if it didn't touch the JVM
> > default max direct memory limit, we cannot get a direct memory OOM and it
> > will become super hard to understand which part of the configuration need
> > to be updated.
> >
> > For option 1.1, it has the similar problem as 1.2, if the exceeded direct
> > memory does not reach the max direct memory limit specified by the
> > dedicated parameter. I think it is slightly better than 1.2, only because
> > we can tune the parameter.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen <[hidden email]> wrote:
> >
> > > About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize
> > it a
> > > bit differently:
> > >
> > > We have the following two options:
> > >
> > > (1) We let MemorySegments be de-allocated by the GC. That makes it
> > segfault
> > > safe. But then we need a way to trigger GC in case de-allocation and
> > > re-allocation of a bunch of segments happens quickly, which is often
> the
> > > case during batch scheduling or task restart.
> > >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
> > >   - Another way could be to have a dedicated bookkeeping in the
> > > MemoryManager (option 1.2), so that this is a number independent of the
> > > "-XX:MaxDirectMemorySize" parameter.
> > >
> > > (2) We manually allocate and de-allocate the memory for the
> > MemorySegments
> > > (option 2). That way we need not worry about triggering GC by some
> > > threshold or bookkeeping, but it is harder to prevent segfaults. We
> need
> > to
> > > be very careful about when we release the memory segments (only in the
> > > cleanup phase of the main thread).
> > >
> > > If we go with option 1.1, we probably need to set
> > > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + direct_memory"
> > and
> > > have "direct_memory" as a separate reserved memory pool. Because if we
> > just
> > > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> > jvm_overhead",
> > > then there will be times when that entire memory is allocated by direct
> > > buffers and we have nothing left for the JVM overhead. So we either
> need
> > a
> > > way to compensate for that (again some safety margin cutoff value) or
> we
> > > will exceed container memory.
> > >
> > > If we go with option 1.2, we need to be aware that it takes elaborate
> > logic
> > > to push recycling of direct buffers without always triggering a full
> GC.
> > >
> > >
> > > My first guess is that the options will be easiest to do in the
> following
> > > order:
> > >
> > >   - Option 1.1 with a dedicated direct_memory parameter, as discussed
> > > above. We would need to find a way to set the direct_memory parameter
> by
> > > default. We could start with 64 MB and see how it goes in practice. One
> > > danger I see is that setting this loo low can cause a bunch of
> additional
> > > GCs compared to before (we need to watch this carefully).
> > >
> > >   - Option 2. It is actually quite simple to implement, we could try
> how
> > > segfault safe we are at the moment.
> > >
> > >   - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize"
> > parameter
> > > at all and assume that all the direct memory allocations that the JVM
> and
> > > Netty do are infrequent enough to be cleaned up fast enough through
> > regular
> > > GC. I am not sure if that is a valid assumption, though.
> > >
> > > Best,
> > > Stephan
> > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for sharing your opinion Till.
> > > >
> > > > I'm also in favor of alternative 2. I was wondering whether we can
> > avoid
> > > > using Unsafe.allocate() for off-heap managed memory and network
> memory
> > > with
> > > > alternative 3. But after giving it a second thought, I think even for
> > > > alternative 3 using direct memory for off-heap managed memory could
> > cause
> > > > problems.
> > > >
> > > > Hi Yang,
> > > >
> > > > Regarding your concern, I think what proposed in this FLIP it to have
> > > both
> > > > off-heap managed memory and network memory allocated through
> > > > Unsafe.allocate(), which means they are practically native memory and
> > not
> > > > limited by JVM max direct memory. The only parts of memory limited by
> > JVM
> > > > max direct memory are task off-heap memory and JVM overhead, which
> are
> > > > exactly alternative 2 suggests to set the JVM max direct memory to.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]>
> > > > wrote:
> > > >
> > > > > Thanks for the clarification Xintong. I understand the two
> > alternatives
> > > > > now.
> > > > >
> > > > > I would be in favour of option 2 because it makes things explicit.
> If
> > > we
> > > > > don't limit the direct memory, I fear that we might end up in a
> > similar
> > > > > situation as we are currently in: The user might see that her
> process
> > > > gets
> > > > > killed by the OS and does not know why this is the case.
> > Consequently,
> > > > she
> > > > > tries to decrease the process memory size (similar to increasing
> the
> > > > cutoff
> > > > > ratio) in order to accommodate for the extra direct memory. Even
> > worse,
> > > > she
> > > > > tries to decrease memory budgets which are not fully used and hence
> > > won't
> > > > > change the overall memory consumption.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > Let me explain this with a concrete example Till.
> > > > > >
> > > > > > Let's say we have the following scenario.
> > > > > >
> > > > > > Total Process Memory: 1GB
> > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed
> > Memory
> > > > and
> > > > > > Network Memory): 800MB
> > > > > >
> > > > > >
> > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> > > > value,
> > > > > > let's say 1TB.
> > > > > >
> > > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > Overhead
> > > > > > do not exceed 200MB, then alternative 2 and alternative 3 should
> > have
> > > > the
> > > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> > reduce
> > > > the
> > > > > > sizes of the other memory pools.
> > > > > >
> > > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > > Overhead potentially exceed 200MB, then
> > > > > >
> > > > > >    - Alternative 2 suffers from frequent OOM. To avoid that, the
> > only
> > > > > thing
> > > > > >    user can do is to modify the configuration and increase JVM
> > Direct
> > > > > > Memory
> > > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > > increases
> > > > > JVM
> > > > > >    Direct Memory to 250MB, this will reduce the total size of
> other
> > > > > memory
> > > > > >    pools to 750MB, given the total process memory remains 1GB.
> > > > > >    - For alternative 3, there is no chance of direct OOM. There
> are
> > > > > chances
> > > > > >    of exceeding the total process memory limit, but given that
> the
> > > > > process
> > > > > > may
> > > > > >    not use up all the reserved native memory (Off-Heap Managed
> > > Memory,
> > > > > > Network
> > > > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > > > slightly
> > > > > > above
> > > > > >    yet very close to 200MB, user probably do not need to change
> the
> > > > > >    configurations.
> > > > > >
> > > > > > Therefore, I think from the user's perspective, a feasible
> > > > configuration
> > > > > > for alternative 2 may lead to lower resource utilization compared
> > to
> > > > > > alternative 3.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I guess you have to help me understand the difference between
> > > > > > alternative 2
> > > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > > >
> > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap
> > Memory
> > > > and
> > > > > > JVM
> > > > > > > Overhead. Then there is the risk that this size is too low
> > > resulting
> > > > > in a
> > > > > > > lot of garbage collection and potentially an OOM.
> > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger
> > > than
> > > > > > > alternative 2. This would of course reduce the sizes of the
> other
> > > > > memory
> > > > > > > types.
> > > > > > >
> > > > > > > How would alternative 2 now result in an under utilization of
> > > memory
> > > > > > > compared to alternative 3? If alternative 3 strictly sets a
> > higher
> > > > max
> > > > > > > direct memory size and we use only little, then I would expect
> > that
> > > > > > > alternative 3 results in memory under utilization.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <
> [hidden email]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi xintong,till
> > > > > > > >
> > > > > > > >
> > > > > > > > > Native and Direct Memory
> > > > > > > >
> > > > > > > > My point is setting a very large max direct memory size when
> we
> > > do
> > > > > not
> > > > > > > > differentiate direct and native memory. If the direct
> > > > > memory,including
> > > > > > > user
> > > > > > > > direct memory and framework direct memory,could be calculated
> > > > > > > > correctly,then
> > > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Memory Calculation
> > > > > > > >
> > > > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> > > memory
> > > > > > > > configurations in client to avoid submitting successfully and
> > > > failing
> > > > > > in
> > > > > > > > the flink master.
> > > > > > > >
> > > > > > > >
> > > > > > > > Best,
> > > > > > > >
> > > > > > > > Yang
> > > > > > > >
> > > > > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > > > > >
> > > > > > > > > Thanks for replying, Till.
> > > > > > > > >
> > > > > > > > > About MemorySegment, I think you are right that we should
> not
> > > > > include
> > > > > > > > this
> > > > > > > > > issue in the scope of this FLIP. This FLIP should
> concentrate
> > > on
> > > > > how
> > > > > > to
> > > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > > involvement
> > > > > on
> > > > > > > how
> > > > > > > > > memory consumers use it.
> > > > > > > > >
> > > > > > > > > About direct memory, I think alternative 3 may not having
> the
> > > > same
> > > > > > over
> > > > > > > > > reservation issue that alternative 2 does, but at the cost
> of
> > > > risk
> > > > > of
> > > > > > > > over
> > > > > > > > > using memory at the container level, which is not good. My
> > > point
> > > > is
> > > > > > > that
> > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy
> > to
> > > > > > config.
> > > > > > > > For
> > > > > > > > > alternative 2, users might configure them higher than what
> > > > actually
> > > > > > > > needed,
> > > > > > > > > just to avoid getting a direct OOM. For alternative 3,
> users
> > do
> > > > not
> > > > > > get
> > > > > > > > > direct OOM, so they may not config the two options
> > aggressively
> > > > > high.
> > > > > > > But
> > > > > > > > > the consequences are risks of overall container memory
> usage
> > > > > exceeds
> > > > > > > the
> > > > > > > > > budget.
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > > >
> > > > > > > > > > All in all I think it already looks quite good.
> Concerning
> > > the
> > > > > > first
> > > > > > > > open
> > > > > > > > > > question about allocating memory segments, I was
> wondering
> > > > > whether
> > > > > > > this
> > > > > > > > > is
> > > > > > > > > > strictly necessary to do in the context of this FLIP or
> > > whether
> > > > > > this
> > > > > > > > > could
> > > > > > > > > > be done as a follow up? Without knowing all details, I
> > would
> > > be
> > > > > > > > concerned
> > > > > > > > > > that we would widen the scope of this FLIP too much
> because
> > > we
> > > > > > would
> > > > > > > > have
> > > > > > > > > > to touch all the existing call sites of the MemoryManager
> > > where
> > > > > we
> > > > > > > > > allocate
> > > > > > > > > > memory segments (this should mainly be batch operators).
> > The
> > > > > > addition
> > > > > > > > of
> > > > > > > > > > the memory reservation call to the MemoryManager should
> not
> > > be
> > > > > > > affected
> > > > > > > > > by
> > > > > > > > > > this and I would hope that this is the only point of
> > > > interaction
> > > > > a
> > > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > > >
> > > > > > > > > > Concerning the second open question about setting or not
> > > > setting
> > > > > a
> > > > > > > max
> > > > > > > > > > direct memory limit, I would also be interested why Yang
> > Wang
> > > > > > thinks
> > > > > > > > > > leaving it open would be best. My concern about this
> would
> > be
> > > > > that
> > > > > > we
> > > > > > > > > would
> > > > > > > > > > be in a similar situation as we are now with the
> > > > > > RocksDBStateBackend.
> > > > > > > > If
> > > > > > > > > > the different memory pools are not clearly separated and
> > can
> > > > > spill
> > > > > > > over
> > > > > > > > > to
> > > > > > > > > > a different pool, then it is quite hard to understand
> what
> > > > > exactly
> > > > > > > > > causes a
> > > > > > > > > > process to get killed for using too much memory. This
> could
> > > > then
> > > > > > > easily
> > > > > > > > > > lead to a similar situation what we have with the
> > > cutoff-ratio.
> > > > > So
> > > > > > > why
> > > > > > > > > not
> > > > > > > > > > setting a sane default value for max direct memory and
> > giving
> > > > the
> > > > > > > user
> > > > > > > > an
> > > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > > >
> > > > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > > > utilization
> > > > > > > than
> > > > > > > > > > alternative 3 where we set the direct memory to a higher
> > > value?
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Till
> > > > > > > > > >
> > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > > [hidden email]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > > >
> > > > > > > > > > > Regarding your comments:
> > > > > > > > > > >
> > > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > > I think setting a very large max direct memory size
> > > > definitely
> > > > > > has
> > > > > > > > some
> > > > > > > > > > > good sides. E.g., we do not worry about direct OOM, and
> > we
> > > > > don't
> > > > > > > even
> > > > > > > > > > need
> > > > > > > > > > > to allocate managed / network memory with
> > > Unsafe.allocate() .
> > > > > > > > > > > However, there are also some down sides of doing this.
> > > > > > > > > > >
> > > > > > > > > > >    - One thing I can think of is that if a task
> executor
> > > > > > container
> > > > > > > is
> > > > > > > > > > >    killed due to overusing memory, it could be hard for
> > use
> > > > to
> > > > > > know
> > > > > > > > > which
> > > > > > > > > > > part
> > > > > > > > > > >    of the memory is overused.
> > > > > > > > > > >    - Another down side is that the JVM never trigger GC
> > due
> > > > to
> > > > > > > > reaching
> > > > > > > > > > max
> > > > > > > > > > >    direct memory limit, because the limit is too high
> to
> > be
> > > > > > > reached.
> > > > > > > > > That
> > > > > > > > > > >    means we kind of relay on heap memory to trigger GC
> > and
> > > > > > release
> > > > > > > > > direct
> > > > > > > > > > >    memory. That could be a problem in cases where we
> have
> > > > more
> > > > > > > direct
> > > > > > > > > > > memory
> > > > > > > > > > >    usage but not enough heap activity to trigger the
> GC.
> > > > > > > > > > >
> > > > > > > > > > > Maybe you can share your reasons for preferring
> setting a
> > > > very
> > > > > > > large
> > > > > > > > > > value,
> > > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > > >
> > > > > > > > > > > *Memory Calculation*
> > > > > > > > > > > If there is any conflict between multiple configuration
> > > that
> > > > > user
> > > > > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > > > > I think doing checking on the client side is a good
> idea,
> > > so
> > > > > that
> > > > > > > on
> > > > > > > > > > Yarn /
> > > > > > > > > > > K8s we can discover the problem before submitting the
> > Flink
> > > > > > > cluster,
> > > > > > > > > > which
> > > > > > > > > > > is always a good thing.
> > > > > > > > > > > But we can not only rely on the client side checking,
> > > because
> > > > > for
> > > > > > > > > > > standalone cluster TaskManagers on different machines
> may
> > > > have
> > > > > > > > > different
> > > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > > What do you think?
> > > > > > > > > > >
> > > > > > > > > > > Thank you~
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi xintong,
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for your detailed proposal. After all the
> memory
> > > > > > > > configuration
> > > > > > > > > > are
> > > > > > > > > > > > introduced, it will be more powerful to control the
> > flink
> > > > > > memory
> > > > > > > > > > usage. I
> > > > > > > > > > > > just have few questions about it.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > > >
> > > > > > > > > > > > We do not differentiate user direct memory and native
> > > > memory.
> > > > > > > They
> > > > > > > > > are
> > > > > > > > > > > all
> > > > > > > > > > > > included in task off-heap memory. Right? So i don’t
> > think
> > > > we
> > > > > > > could
> > > > > > > > > not
> > > > > > > > > > > set
> > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer
> leaving
> > > it a
> > > > > > very
> > > > > > > > > large
> > > > > > > > > > > > value.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > > >
> > > > > > > > > > > > If the sum of and fine-grained memory(network memory,
> > > > managed
> > > > > > > > memory,
> > > > > > > > > > > etc.)
> > > > > > > > > > > > is larger than total process memory, how do we deal
> > with
> > > > this
> > > > > > > > > > situation?
> > > > > > > > > > > Do
> > > > > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> > > > 下午10:14写道:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > >
> > > > > > > > > > > > > We would like to start a discussion thread on
> > "FLIP-49:
> > > > > > Unified
> > > > > > > > > > Memory
> > > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> > describe
> > > > how
> > > > > to
> > > > > > > > > improve
> > > > > > > > > > > > > TaskExecutor memory configurations. The FLIP
> document
> > > is
> > > > > > mostly
> > > > > > > > > based
> > > > > > > > > > > on
> > > > > > > > > > > > an
> > > > > > > > > > > > > early design "Memory Management and Configuration
> > > > > > Reloaded"[2]
> > > > > > > by
> > > > > > > > > > > > Stephan,
> > > > > > > > > > > > > with updates from follow-up discussions both online
> > and
> > > > > > > offline.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This FLIP addresses several shortcomings of current
> > > > (Flink
> > > > > > 1.9)
> > > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Different configuration for Streaming and
> Batch.
> > > > > > > > > > > > >    - Complex and difficult configuration of RocksDB
> > in
> > > > > > > Streaming.
> > > > > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Key changes to solve the problems can be summarized
> > as
> > > > > > follows.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Extend memory manager to also account for
> memory
> > > > usage
> > > > > > by
> > > > > > > > > state
> > > > > > > > > > > > >    backends.
> > > > > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > > > > accounted
> > > > > > > > > > individual
> > > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > > >    - Simplify memory configuration options and
> > > > calculations
> > > > > > > > logics.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please find more details in the FLIP wiki document
> > [1].
> > > > > > > > > > > > >
> > > > > > > > > > > > > (Please note that the early design doc [2] is out
> of
> > > > sync,
> > > > > > and
> > > > > > > it
> > > > > > > > > is
> > > > > > > > > > > > > appreciated to have the discussion in this mailing
> > list
> > > > > > > thread.)
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > > >
> > > > > > > > > > > > > [2]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for sharing your opinion Till.
> > > >
> > > > I'm also in favor of alternative 2. I was wondering whether we can
> > avoid
> > > > using Unsafe.allocate() for off-heap managed memory and network
> memory
> > > with
> > > > alternative 3. But after giving it a second thought, I think even for
> > > > alternative 3 using direct memory for off-heap managed memory could
> > cause
> > > > problems.
> > > >
> > > > Hi Yang,
> > > >
> > > > Regarding your concern, I think what proposed in this FLIP it to have
> > > both
> > > > off-heap managed memory and network memory allocated through
> > > > Unsafe.allocate(), which means they are practically native memory and
> > not
> > > > limited by JVM max direct memory. The only parts of memory limited by
> > JVM
> > > > max direct memory are task off-heap memory and JVM overhead, which
> are
> > > > exactly alternative 2 suggests to set the JVM max direct memory to.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <[hidden email]>
> > > > wrote:
> > > >
> > > > > Thanks for the clarification Xintong. I understand the two
> > alternatives
> > > > > now.
> > > > >
> > > > > I would be in favour of option 2 because it makes things explicit.
> If
> > > we
> > > > > don't limit the direct memory, I fear that we might end up in a
> > similar
> > > > > situation as we are currently in: The user might see that her
> process
> > > > gets
> > > > > killed by the OS and does not know why this is the case.
> > Consequently,
> > > > she
> > > > > tries to decrease the process memory size (similar to increasing
> the
> > > > cutoff
> > > > > ratio) in order to accommodate for the extra direct memory. Even
> > worse,
> > > > she
> > > > > tries to decrease memory budgets which are not fully used and hence
> > > won't
> > > > > change the overall memory consumption.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > Let me explain this with a concrete example Till.
> > > > > >
> > > > > > Let's say we have the following scenario.
> > > > > >
> > > > > > Total Process Memory: 1GB
> > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed
> > Memory
> > > > and
> > > > > > Network Memory): 800MB
> > > > > >
> > > > > >
> > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> > > > value,
> > > > > > let's say 1TB.
> > > > > >
> > > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > Overhead
> > > > > > do not exceed 200MB, then alternative 2 and alternative 3 should
> > have
> > > > the
> > > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> > reduce
> > > > the
> > > > > > sizes of the other memory pools.
> > > > > >
> > > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > > Overhead potentially exceed 200MB, then
> > > > > >
> > > > > >    - Alternative 2 suffers from frequent OOM. To avoid that, the
> > only
> > > > > thing
> > > > > >    user can do is to modify the configuration and increase JVM
> > Direct
> > > > > > Memory
> > > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > > increases
> > > > > JVM
> > > > > >    Direct Memory to 250MB, this will reduce the total size of
> other
> > > > > memory
> > > > > >    pools to 750MB, given the total process memory remains 1GB.
> > > > > >    - For alternative 3, there is no chance of direct OOM. There
> are
> > > > > chances
> > > > > >    of exceeding the total process memory limit, but given that
> the
> > > > > process
> > > > > > may
> > > > > >    not use up all the reserved native memory (Off-Heap Managed
> > > Memory,
> > > > > > Network
> > > > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > > > slightly
> > > > > > above
> > > > > >    yet very close to 200MB, user probably do not need to change
> the
> > > > > >    configurations.
> > > > > >
> > > > > > Therefore, I think from the user's perspective, a feasible
> > > > configuration
> > > > > > for alternative 2 may lead to lower resource utilization compared
> > to
> > > > > > alternative 3.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I guess you have to help me understand the difference between
> > > > > > alternative 2
> > > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > > >
> > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap
> > Memory
> > > > and
> > > > > > JVM
> > > > > > > Overhead. Then there is the risk that this size is too low
> > > resulting
> > > > > in a
> > > > > > > lot of garbage collection and potentially an OOM.
> > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger
> > > than
> > > > > > > alternative 2. This would of course reduce the sizes of the
> other
> > > > > memory
> > > > > > > types.
> > > > > > >
> > > > > > > How would alternative 2 now result in an under utilization of
> > > memory
> > > > > > > compared to alternative 3? If alternative 3 strictly sets a
> > higher
> > > > max
> > > > > > > direct memory size and we use only little, then I would expect
> > that
> > > > > > > alternative 3 results in memory under utilization.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <
> [hidden email]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi xintong,till
> > > > > > > >
> > > > > > > >
> > > > > > > > > Native and Direct Memory
> > > > > > > >
> > > > > > > > My point is setting a very large max direct memory size when
> we
> > > do
> > > > > not
> > > > > > > > differentiate direct and native memory. If the direct
> > > > > memory,including
> > > > > > > user
> > > > > > > > direct memory and framework direct memory,could be calculated
> > > > > > > > correctly,then
> > > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Memory Calculation
> > > > > > > >
> > > > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> > > memory
> > > > > > > > configurations in client to avoid submitting successfully and
> > > > failing
> > > > > > in
> > > > > > > > the flink master.
> > > > > > > >
> > > > > > > >
> > > > > > > > Best,
> > > > > > > >
> > > > > > > > Yang
> > > > > > > >
> > > > > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > > > > >
> > > > > > > > > Thanks for replying, Till.
> > > > > > > > >
> > > > > > > > > About MemorySegment, I think you are right that we should
> not
> > > > > include
> > > > > > > > this
> > > > > > > > > issue in the scope of this FLIP. This FLIP should
> concentrate
> > > on
> > > > > how
> > > > > > to
> > > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > > involvement
> > > > > on
> > > > > > > how
> > > > > > > > > memory consumers use it.
> > > > > > > > >
> > > > > > > > > About direct memory, I think alternative 3 may not having
> the
> > > > same
> > > > > > over
> > > > > > > > > reservation issue that alternative 2 does, but at the cost
> of
> > > > risk
> > > > > of
> > > > > > > > over
> > > > > > > > > using memory at the container level, which is not good. My
> > > point
> > > > is
> > > > > > > that
> > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy
> > to
> > > > > > config.
> > > > > > > > For
> > > > > > > > > alternative 2, users might configure them higher than what
> > > > actually
> > > > > > > > needed,
> > > > > > > > > just to avoid getting a direct OOM. For alternative 3,
> users
> > do
> > > > not
> > > > > > get
> > > > > > > > > direct OOM, so they may not config the two options
> > aggressively
> > > > > high.
> > > > > > > But
> > > > > > > > > the consequences are risks of overall container memory
> usage
> > > > > exceeds
> > > > > > > the
> > > > > > > > > budget.
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > > >
> > > > > > > > > > All in all I think it already looks quite good.
> Concerning
> > > the
> > > > > > first
> > > > > > > > open
> > > > > > > > > > question about allocating memory segments, I was
> wondering
> > > > > whether
> > > > > > > this
> > > > > > > > > is
> > > > > > > > > > strictly necessary to do in the context of this FLIP or
> > > whether
> > > > > > this
> > > > > > > > > could
> > > > > > > > > > be done as a follow up? Without knowing all details, I
> > would
> > > be
> > > > > > > > concerned
> > > > > > > > > > that we would widen the scope of this FLIP too much
> because
> > > we
> > > > > > would
> > > > > > > > have
> > > > > > > > > > to touch all the existing call sites of the MemoryManager
> > > where
> > > > > we
> > > > > > > > > allocate
> > > > > > > > > > memory segments (this should mainly be batch operators).
> > The
> > > > > > addition
> > > > > > > > of
> > > > > > > > > > the memory reservation call to the MemoryManager should
> not
> > > be
> > > > > > > affected
> > > > > > > > > by
> > > > > > > > > > this and I would hope that this is the only point of
> > > > interaction
> > > > > a
> > > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > > >
> > > > > > > > > > Concerning the second open question about setting or not
> > > > setting
> > > > > a
> > > > > > > max
> > > > > > > > > > direct memory limit, I would also be interested why Yang
> > Wang
> > > > > > thinks
> > > > > > > > > > leaving it open would be best. My concern about this
> would
> > be
> > > > > that
> > > > > > we
> > > > > > > > > would
> > > > > > > > > > be in a similar situation as we are now with the
> > > > > > RocksDBStateBackend.
> > > > > > > > If
> > > > > > > > > > the different memory pools are not clearly separated and
> > can
> > > > > spill
> > > > > > > over
> > > > > > > > > to
> > > > > > > > > > a different pool, then it is quite hard to understand
> what
> > > > > exactly
> > > > > > > > > causes a
> > > > > > > > > > process to get killed for using too much memory. This
> could
> > > > then
> > > > > > > easily
> > > > > > > > > > lead to a similar situation what we have with the
> > > cutoff-ratio.
> > > > > So
> > > > > > > why
> > > > > > > > > not
> > > > > > > > > > setting a sane default value for max direct memory and
> > giving
> > > > the
> > > > > > > user
> > > > > > > > an
> > > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > > >
> > > > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > > > utilization
> > > > > > > than
> > > > > > > > > > alternative 3 where we set the direct memory to a higher
> > > value?
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Till
> > > > > > > > > >
> > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > > [hidden email]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > > >
> > > > > > > > > > > Regarding your comments:
> > > > > > > > > > >
> > > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > > I think setting a very large max direct memory size
> > > > definitely
> > > > > > has
> > > > > > > > some
> > > > > > > > > > > good sides. E.g., we do not worry about direct OOM, and
> > we
> > > > > don't
> > > > > > > even
> > > > > > > > > > need
> > > > > > > > > > > to allocate managed / network memory with
> > > Unsafe.allocate() .
> > > > > > > > > > > However, there are also some down sides of doing this.
> > > > > > > > > > >
> > > > > > > > > > >    - One thing I can think of is that if a task
> executor
> > > > > > container
> > > > > > > is
> > > > > > > > > > >    killed due to overusing memory, it could be hard for
> > use
> > > > to
> > > > > > know
> > > > > > > > > which
> > > > > > > > > > > part
> > > > > > > > > > >    of the memory is overused.
> > > > > > > > > > >    - Another down side is that the JVM never trigger GC
> > due
> > > > to
> > > > > > > > reaching
> > > > > > > > > > max
> > > > > > > > > > >    direct memory limit, because the limit is too high
> to
> > be
> > > > > > > reached.
> > > > > > > > > That
> > > > > > > > > > >    means we kind of relay on heap memory to trigger GC
> > and
> > > > > > release
> > > > > > > > > direct
> > > > > > > > > > >    memory. That could be a problem in cases where we
> have
> > > > more
> > > > > > > direct
> > > > > > > > > > > memory
> > > > > > > > > > >    usage but not enough heap activity to trigger the
> GC.
> > > > > > > > > > >
> > > > > > > > > > > Maybe you can share your reasons for preferring
> setting a
> > > > very
> > > > > > > large
> > > > > > > > > > value,
> > > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > > >
> > > > > > > > > > > *Memory Calculation*
> > > > > > > > > > > If there is any conflict between multiple configuration
> > > that
> > > > > user
> > > > > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > > > > I think doing checking on the client side is a good
> idea,
> > > so
> > > > > that
> > > > > > > on
> > > > > > > > > > Yarn /
> > > > > > > > > > > K8s we can discover the problem before submitting the
> > Flink
> > > > > > > cluster,
> > > > > > > > > > which
> > > > > > > > > > > is always a good thing.
> > > > > > > > > > > But we can not only rely on the client side checking,
> > > because
> > > > > for
> > > > > > > > > > > standalone cluster TaskManagers on different machines
> may
> > > > have
> > > > > > > > > different
> > > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > > What do you think?
> > > > > > > > > > >
> > > > > > > > > > > Thank you~
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi xintong,
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for your detailed proposal. After all the
> memory
> > > > > > > > configuration
> > > > > > > > > > are
> > > > > > > > > > > > introduced, it will be more powerful to control the
> > flink
> > > > > > memory
> > > > > > > > > > usage. I
> > > > > > > > > > > > just have few questions about it.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > > >
> > > > > > > > > > > > We do not differentiate user direct memory and native
> > > > memory.
> > > > > > > They
> > > > > > > > > are
> > > > > > > > > > > all
> > > > > > > > > > > > included in task off-heap memory. Right? So i don’t
> > think
> > > > we
> > > > > > > could
> > > > > > > > > not
> > > > > > > > > > > set
> > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer
> leaving
> > > it a
> > > > > > very
> > > > > > > > > large
> > > > > > > > > > > > value.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > > >
> > > > > > > > > > > > If the sum of and fine-grained memory(network memory,
> > > > managed
> > > > > > > > memory,
> > > > > > > > > > > etc.)
> > > > > > > > > > > > is larger than total process memory, how do we deal
> > with
> > > > this
> > > > > > > > > > situation?
> > > > > > > > > > > Do
> > > > > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> > > > 下午10:14写道:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > >
> > > > > > > > > > > > > We would like to start a discussion thread on
> > "FLIP-49:
> > > > > > Unified
> > > > > > > > > > Memory
> > > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> > describe
> > > > how
> > > > > to
> > > > > > > > > improve
> > > > > > > > > > > > > TaskExecutor memory configurations. The FLIP
> document
> > > is
> > > > > > mostly
> > > > > > > > > based
> > > > > > > > > > > on
> > > > > > > > > > > > an
> > > > > > > > > > > > > early design "Memory Management and Configuration
> > > > > > Reloaded"[2]
> > > > > > > by
> > > > > > > > > > > > Stephan,
> > > > > > > > > > > > > with updates from follow-up discussions both online
> > and
> > > > > > > offline.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This FLIP addresses several shortcomings of current
> > > > (Flink
> > > > > > 1.9)
> > > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Different configuration for Streaming and
> Batch.
> > > > > > > > > > > > >    - Complex and difficult configuration of RocksDB
> > in
> > > > > > > Streaming.
> > > > > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Key changes to solve the problems can be summarized
> > as
> > > > > > follows.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Extend memory manager to also account for
> memory
> > > > usage
> > > > > > by
> > > > > > > > > state
> > > > > > > > > > > > >    backends.
> > > > > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > > > > accounted
> > > > > > > > > > individual
> > > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > > >    - Simplify memory configuration options and
> > > > calculations
> > > > > > > > logics.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please find more details in the FLIP wiki document
> > [1].
> > > > > > > > > > > > >
> > > > > > > > > > > > > (Please note that the early design doc [2] is out
> of
> > > > sync,
> > > > > > and
> > > > > > > it
> > > > > > > > > is
> > > > > > > > > > > > > appreciated to have the discussion in this mailing
> > list
> > > > > > > thread.)
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > > >
> > > > > > > > > > > > > [2]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Stephan Ewen
@Xintong: Concerning "wait for memory users before task dispose and memory
release": I agree, that's how it should be. Let's try it out.

@Xintong @Jingsong: Concerning " JVM does not wait for GC when allocating
direct memory buffer": There seems to be pretty elaborate logic to free
buffers when allocating new ones. See
http://hg.openjdk.java.net/jdk8u/jdk8u-dev/jdk/file/tip/src/share/classes/java/nio/Bits.java#l643

@Till: Maybe. If we assume that the JVM default works (like going with
option 2 and not setting "-XX:MaxDirectMemorySize" at all), then I think it
should be okay to set "-XX:MaxDirectMemorySize" to
"off_heap_managed_memory + direct_memory" even if we use RocksDB. That is a
big if, though, I honestly have no idea :D Would be good to understand
this, though, because this would affect option (2) and option (1.2).

On Mon, Aug 19, 2019 at 4:44 PM Xintong Song <[hidden email]> wrote:

> Thanks for the inputs, Jingsong.
>
> Let me try to summarize your points. Please correct me if I'm wrong.
>
>    - Memory consumers should always avoid returning memory segments to
>    memory manager while there are still un-cleaned structures / threads
> that
>    may use the memory. Otherwise, it would cause serious problems by having
>    multiple consumers trying to use the same memory segment.
>    - JVM does not wait for GC when allocating direct memory buffer.
>    Therefore even we set proper max direct memory size limit, we may still
>    encounter direct memory oom if the GC cleaning memory slower than the
>    direct memory allocation.
>
> Am I understanding this correctly?
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Aug 19, 2019 at 4:21 PM JingsongLee <[hidden email]
> .invalid>
> wrote:
>
> > Hi stephan:
> >
> > About option 2:
> >
> > if additional threads not cleanly shut down before we can exit the task:
> > In the current case of memory reuse, it has freed up the memory it
> >  uses. If this memory is used by other tasks and asynchronous threads
> >  of exited task may still be writing, there will be concurrent security
> >  problems, and even lead to errors in user computing results.
> >
> > So I think this is a serious and intolerable bug, No matter what the
> >  option is, it should be avoided.
> >
> > About direct memory cleaned by GC:
> > I don't think it is a good idea, I've encountered so many situations
> >  that it's too late for GC to cause DirectMemory OOM. Release and
> >  allocate DirectMemory depend on the type of user job, which is
> >  often beyond our control.
> >
> > Best,
> > Jingsong Lee
> >
> >
> > ------------------------------------------------------------------
> > From:Stephan Ewen <[hidden email]>
> > Send Time:2019年8月19日(星期一) 15:56
> > To:dev <[hidden email]>
> > Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for
> > TaskExecutors
> >
> > My main concern with option 2 (manually release memory) is that segfaults
> > in the JVM send off all sorts of alarms on user ends. So we need to
> > guarantee that this never happens.
> >
> > The trickyness is in tasks that uses data structures / algorithms with
> > additional threads, like hash table spill/read and sorting threads. We
> need
> > to ensure that these cleanly shut down before we can exit the task.
> > I am not sure that we have that guaranteed already, that's why option 1.1
> > seemed simpler to me.
> >
> > On Mon, Aug 19, 2019 at 3:42 PM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Thanks for the comments, Stephan. Summarized in this way really makes
> > > things easier to understand.
> > >
> > > I'm in favor of option 2, at least for the moment. I think it is not
> that
> > > difficult to keep it segfault safe for memory manager, as long as we
> > always
> > > de-allocate the memory segment when it is released from the memory
> > > consumers. Only if the memory consumer continue using the buffer of
> > memory
> > > segment after releasing it, in which case we do want the job to fail so
> > we
> > > detect the memory leak early.
> > >
> > > For option 1.2, I don't think this is a good idea. Not only because the
> > > assumption (regular GC is enough to clean direct buffers) may not
> always
> > be
> > > true, but also it makes harder for finding problems in cases of memory
> > > overuse. E.g., user configured some direct memory for the user
> libraries.
> > > If the library actually use more direct memory then configured, which
> > > cannot be cleaned by GC because they are still in use, may lead to
> > overuse
> > > of the total container memory. In that case, if it didn't touch the JVM
> > > default max direct memory limit, we cannot get a direct memory OOM and
> it
> > > will become super hard to understand which part of the configuration
> need
> > > to be updated.
> > >
> > > For option 1.1, it has the similar problem as 1.2, if the exceeded
> direct
> > > memory does not reach the max direct memory limit specified by the
> > > dedicated parameter. I think it is slightly better than 1.2, only
> because
> > > we can tune the parameter.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen <[hidden email]> wrote:
> > >
> > > > About the "-XX:MaxDirectMemorySize" discussion, maybe let me
> summarize
> > > it a
> > > > bit differently:
> > > >
> > > > We have the following two options:
> > > >
> > > > (1) We let MemorySegments be de-allocated by the GC. That makes it
> > > segfault
> > > > safe. But then we need a way to trigger GC in case de-allocation and
> > > > re-allocation of a bunch of segments happens quickly, which is often
> > the
> > > > case during batch scheduling or task restart.
> > > >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
> > > >   - Another way could be to have a dedicated bookkeeping in the
> > > > MemoryManager (option 1.2), so that this is a number independent of
> the
> > > > "-XX:MaxDirectMemorySize" parameter.
> > > >
> > > > (2) We manually allocate and de-allocate the memory for the
> > > MemorySegments
> > > > (option 2). That way we need not worry about triggering GC by some
> > > > threshold or bookkeeping, but it is harder to prevent segfaults. We
> > need
> > > to
> > > > be very careful about when we release the memory segments (only in
> the
> > > > cleanup phase of the main thread).
> > > >
> > > > If we go with option 1.1, we probably need to set
> > > > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> direct_memory"
> > > and
> > > > have "direct_memory" as a separate reserved memory pool. Because if
> we
> > > just
> > > > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> > > jvm_overhead",
> > > > then there will be times when that entire memory is allocated by
> direct
> > > > buffers and we have nothing left for the JVM overhead. So we either
> > need
> > > a
> > > > way to compensate for that (again some safety margin cutoff value) or
> > we
> > > > will exceed container memory.
> > > >
> > > > If we go with option 1.2, we need to be aware that it takes elaborate
> > > logic
> > > > to push recycling of direct buffers without always triggering a full
> > GC.
> > > >
> > > >
> > > > My first guess is that the options will be easiest to do in the
> > following
> > > > order:
> > > >
> > > >   - Option 1.1 with a dedicated direct_memory parameter, as discussed
> > > > above. We would need to find a way to set the direct_memory parameter
> > by
> > > > default. We could start with 64 MB and see how it goes in practice.
> One
> > > > danger I see is that setting this loo low can cause a bunch of
> > additional
> > > > GCs compared to before (we need to watch this carefully).
> > > >
> > > >   - Option 2. It is actually quite simple to implement, we could try
> > how
> > > > segfault safe we are at the moment.
> > > >
> > > >   - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize"
> > > parameter
> > > > at all and assume that all the direct memory allocations that the JVM
> > and
> > > > Netty do are infrequent enough to be cleaned up fast enough through
> > > regular
> > > > GC. I am not sure if that is a valid assumption, though.
> > > >
> > > > Best,
> > > > Stephan
> > > >
> > > >
> > > >
> > > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]>
> > > > wrote:
> > > >
> > > > > Thanks for sharing your opinion Till.
> > > > >
> > > > > I'm also in favor of alternative 2. I was wondering whether we can
> > > avoid
> > > > > using Unsafe.allocate() for off-heap managed memory and network
> > memory
> > > > with
> > > > > alternative 3. But after giving it a second thought, I think even
> for
> > > > > alternative 3 using direct memory for off-heap managed memory could
> > > cause
> > > > > problems.
> > > > >
> > > > > Hi Yang,
> > > > >
> > > > > Regarding your concern, I think what proposed in this FLIP it to
> have
> > > > both
> > > > > off-heap managed memory and network memory allocated through
> > > > > Unsafe.allocate(), which means they are practically native memory
> and
> > > not
> > > > > limited by JVM max direct memory. The only parts of memory limited
> by
> > > JVM
> > > > > max direct memory are task off-heap memory and JVM overhead, which
> > are
> > > > > exactly alternative 2 suggests to set the JVM max direct memory to.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <
> [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Thanks for the clarification Xintong. I understand the two
> > > alternatives
> > > > > > now.
> > > > > >
> > > > > > I would be in favour of option 2 because it makes things
> explicit.
> > If
> > > > we
> > > > > > don't limit the direct memory, I fear that we might end up in a
> > > similar
> > > > > > situation as we are currently in: The user might see that her
> > process
> > > > > gets
> > > > > > killed by the OS and does not know why this is the case.
> > > Consequently,
> > > > > she
> > > > > > tries to decrease the process memory size (similar to increasing
> > the
> > > > > cutoff
> > > > > > ratio) in order to accommodate for the extra direct memory. Even
> > > worse,
> > > > > she
> > > > > > tries to decrease memory budgets which are not fully used and
> hence
> > > > won't
> > > > > > change the overall memory consumption.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Let me explain this with a concrete example Till.
> > > > > > >
> > > > > > > Let's say we have the following scenario.
> > > > > > >
> > > > > > > Total Process Memory: 1GB
> > > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed
> > > Memory
> > > > > and
> > > > > > > Network Memory): 800MB
> > > > > > >
> > > > > > >
> > > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very
> large
> > > > > value,
> > > > > > > let's say 1TB.
> > > > > > >
> > > > > > > If the actual direct memory usage of Task Off-Heap Memory and
> JVM
> > > > > > Overhead
> > > > > > > do not exceed 200MB, then alternative 2 and alternative 3
> should
> > > have
> > > > > the
> > > > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> > > reduce
> > > > > the
> > > > > > > sizes of the other memory pools.
> > > > > > >
> > > > > > > If the actual direct memory usage of Task Off-Heap Memory and
> JVM
> > > > > > > Overhead potentially exceed 200MB, then
> > > > > > >
> > > > > > >    - Alternative 2 suffers from frequent OOM. To avoid that,
> the
> > > only
> > > > > > thing
> > > > > > >    user can do is to modify the configuration and increase JVM
> > > Direct
> > > > > > > Memory
> > > > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > > > increases
> > > > > > JVM
> > > > > > >    Direct Memory to 250MB, this will reduce the total size of
> > other
> > > > > > memory
> > > > > > >    pools to 750MB, given the total process memory remains 1GB.
> > > > > > >    - For alternative 3, there is no chance of direct OOM. There
> > are
> > > > > > chances
> > > > > > >    of exceeding the total process memory limit, but given that
> > the
> > > > > > process
> > > > > > > may
> > > > > > >    not use up all the reserved native memory (Off-Heap Managed
> > > > Memory,
> > > > > > > Network
> > > > > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > > > > slightly
> > > > > > > above
> > > > > > >    yet very close to 200MB, user probably do not need to change
> > the
> > > > > > >    configurations.
> > > > > > >
> > > > > > > Therefore, I think from the user's perspective, a feasible
> > > > > configuration
> > > > > > > for alternative 2 may lead to lower resource utilization
> compared
> > > to
> > > > > > > alternative 3.
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> > > [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I guess you have to help me understand the difference between
> > > > > > > alternative 2
> > > > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > > > >
> > > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap
> > > Memory
> > > > > and
> > > > > > > JVM
> > > > > > > > Overhead. Then there is the risk that this size is too low
> > > > resulting
> > > > > > in a
> > > > > > > > lot of garbage collection and potentially an OOM.
> > > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something
> larger
> > > > than
> > > > > > > > alternative 2. This would of course reduce the sizes of the
> > other
> > > > > > memory
> > > > > > > > types.
> > > > > > > >
> > > > > > > > How would alternative 2 now result in an under utilization of
> > > > memory
> > > > > > > > compared to alternative 3? If alternative 3 strictly sets a
> > > higher
> > > > > max
> > > > > > > > direct memory size and we use only little, then I would
> expect
> > > that
> > > > > > > > alternative 3 results in memory under utilization.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Till
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi xintong,till
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Native and Direct Memory
> > > > > > > > >
> > > > > > > > > My point is setting a very large max direct memory size
> when
> > we
> > > > do
> > > > > > not
> > > > > > > > > differentiate direct and native memory. If the direct
> > > > > > memory,including
> > > > > > > > user
> > > > > > > > > direct memory and framework direct memory,could be
> calculated
> > > > > > > > > correctly,then
> > > > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Memory Calculation
> > > > > > > > >
> > > > > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> > > > memory
> > > > > > > > > configurations in client to avoid submitting successfully
> and
> > > > > failing
> > > > > > > in
> > > > > > > > > the flink master.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > >
> > > > > > > > > Yang
> > > > > > > > >
> > > > > > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > > > > > >
> > > > > > > > > > Thanks for replying, Till.
> > > > > > > > > >
> > > > > > > > > > About MemorySegment, I think you are right that we should
> > not
> > > > > > include
> > > > > > > > > this
> > > > > > > > > > issue in the scope of this FLIP. This FLIP should
> > concentrate
> > > > on
> > > > > > how
> > > > > > > to
> > > > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > > > involvement
> > > > > > on
> > > > > > > > how
> > > > > > > > > > memory consumers use it.
> > > > > > > > > >
> > > > > > > > > > About direct memory, I think alternative 3 may not having
> > the
> > > > > same
> > > > > > > over
> > > > > > > > > > reservation issue that alternative 2 does, but at the
> cost
> > of
> > > > > risk
> > > > > > of
> > > > > > > > > over
> > > > > > > > > > using memory at the container level, which is not good.
> My
> > > > point
> > > > > is
> > > > > > > > that
> > > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not
> easy
> > > to
> > > > > > > config.
> > > > > > > > > For
> > > > > > > > > > alternative 2, users might configure them higher than
> what
> > > > > actually
> > > > > > > > > needed,
> > > > > > > > > > just to avoid getting a direct OOM. For alternative 3,
> > users
> > > do
> > > > > not
> > > > > > > get
> > > > > > > > > > direct OOM, so they may not config the two options
> > > aggressively
> > > > > > high.
> > > > > > > > But
> > > > > > > > > > the consequences are risks of overall container memory
> > usage
> > > > > > exceeds
> > > > > > > > the
> > > > > > > > > > budget.
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > > > >
> > > > > > > > > > > All in all I think it already looks quite good.
> > Concerning
> > > > the
> > > > > > > first
> > > > > > > > > open
> > > > > > > > > > > question about allocating memory segments, I was
> > wondering
> > > > > > whether
> > > > > > > > this
> > > > > > > > > > is
> > > > > > > > > > > strictly necessary to do in the context of this FLIP or
> > > > whether
> > > > > > > this
> > > > > > > > > > could
> > > > > > > > > > > be done as a follow up? Without knowing all details, I
> > > would
> > > > be
> > > > > > > > > concerned
> > > > > > > > > > > that we would widen the scope of this FLIP too much
> > because
> > > > we
> > > > > > > would
> > > > > > > > > have
> > > > > > > > > > > to touch all the existing call sites of the
> MemoryManager
> > > > where
> > > > > > we
> > > > > > > > > > allocate
> > > > > > > > > > > memory segments (this should mainly be batch
> operators).
> > > The
> > > > > > > addition
> > > > > > > > > of
> > > > > > > > > > > the memory reservation call to the MemoryManager should
> > not
> > > > be
> > > > > > > > affected
> > > > > > > > > > by
> > > > > > > > > > > this and I would hope that this is the only point of
> > > > > interaction
> > > > > > a
> > > > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > > > >
> > > > > > > > > > > Concerning the second open question about setting or
> not
> > > > > setting
> > > > > > a
> > > > > > > > max
> > > > > > > > > > > direct memory limit, I would also be interested why
> Yang
> > > Wang
> > > > > > > thinks
> > > > > > > > > > > leaving it open would be best. My concern about this
> > would
> > > be
> > > > > > that
> > > > > > > we
> > > > > > > > > > would
> > > > > > > > > > > be in a similar situation as we are now with the
> > > > > > > RocksDBStateBackend.
> > > > > > > > > If
> > > > > > > > > > > the different memory pools are not clearly separated
> and
> > > can
> > > > > > spill
> > > > > > > > over
> > > > > > > > > > to
> > > > > > > > > > > a different pool, then it is quite hard to understand
> > what
> > > > > > exactly
> > > > > > > > > > causes a
> > > > > > > > > > > process to get killed for using too much memory. This
> > could
> > > > > then
> > > > > > > > easily
> > > > > > > > > > > lead to a similar situation what we have with the
> > > > cutoff-ratio.
> > > > > > So
> > > > > > > > why
> > > > > > > > > > not
> > > > > > > > > > > setting a sane default value for max direct memory and
> > > giving
> > > > > the
> > > > > > > > user
> > > > > > > > > an
> > > > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > > > >
> > > > > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > > > > utilization
> > > > > > > > than
> > > > > > > > > > > alternative 3 where we set the direct memory to a
> higher
> > > > value?
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Till
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > > > [hidden email]
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > > > >
> > > > > > > > > > > > Regarding your comments:
> > > > > > > > > > > >
> > > > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > > > I think setting a very large max direct memory size
> > > > > definitely
> > > > > > > has
> > > > > > > > > some
> > > > > > > > > > > > good sides. E.g., we do not worry about direct OOM,
> and
> > > we
> > > > > > don't
> > > > > > > > even
> > > > > > > > > > > need
> > > > > > > > > > > > to allocate managed / network memory with
> > > > Unsafe.allocate() .
> > > > > > > > > > > > However, there are also some down sides of doing
> this.
> > > > > > > > > > > >
> > > > > > > > > > > >    - One thing I can think of is that if a task
> > executor
> > > > > > > container
> > > > > > > > is
> > > > > > > > > > > >    killed due to overusing memory, it could be hard
> for
> > > use
> > > > > to
> > > > > > > know
> > > > > > > > > > which
> > > > > > > > > > > > part
> > > > > > > > > > > >    of the memory is overused.
> > > > > > > > > > > >    - Another down side is that the JVM never trigger
> GC
> > > due
> > > > > to
> > > > > > > > > reaching
> > > > > > > > > > > max
> > > > > > > > > > > >    direct memory limit, because the limit is too high
> > to
> > > be
> > > > > > > > reached.
> > > > > > > > > > That
> > > > > > > > > > > >    means we kind of relay on heap memory to trigger
> GC
> > > and
> > > > > > > release
> > > > > > > > > > direct
> > > > > > > > > > > >    memory. That could be a problem in cases where we
> > have
> > > > > more
> > > > > > > > direct
> > > > > > > > > > > > memory
> > > > > > > > > > > >    usage but not enough heap activity to trigger the
> > GC.
> > > > > > > > > > > >
> > > > > > > > > > > > Maybe you can share your reasons for preferring
> > setting a
> > > > > very
> > > > > > > > large
> > > > > > > > > > > value,
> > > > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > > > >
> > > > > > > > > > > > *Memory Calculation*
> > > > > > > > > > > > If there is any conflict between multiple
> configuration
> > > > that
> > > > > > user
> > > > > > > > > > > > explicitly specified, I think we should throw an
> error.
> > > > > > > > > > > > I think doing checking on the client side is a good
> > idea,
> > > > so
> > > > > > that
> > > > > > > > on
> > > > > > > > > > > Yarn /
> > > > > > > > > > > > K8s we can discover the problem before submitting the
> > > Flink
> > > > > > > > cluster,
> > > > > > > > > > > which
> > > > > > > > > > > > is always a good thing.
> > > > > > > > > > > > But we can not only rely on the client side checking,
> > > > because
> > > > > > for
> > > > > > > > > > > > standalone cluster TaskManagers on different machines
> > may
> > > > > have
> > > > > > > > > > different
> > > > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > > > What do you think?
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi xintong,
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for your detailed proposal. After all the
> > memory
> > > > > > > > > configuration
> > > > > > > > > > > are
> > > > > > > > > > > > > introduced, it will be more powerful to control the
> > > flink
> > > > > > > memory
> > > > > > > > > > > usage. I
> > > > > > > > > > > > > just have few questions about it.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > > > >
> > > > > > > > > > > > > We do not differentiate user direct memory and
> native
> > > > > memory.
> > > > > > > > They
> > > > > > > > > > are
> > > > > > > > > > > > all
> > > > > > > > > > > > > included in task off-heap memory. Right? So i don’t
> > > think
> > > > > we
> > > > > > > > could
> > > > > > > > > > not
> > > > > > > > > > > > set
> > > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer
> > leaving
> > > > it a
> > > > > > > very
> > > > > > > > > > large
> > > > > > > > > > > > > value.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > > > >
> > > > > > > > > > > > > If the sum of and fine-grained memory(network
> memory,
> > > > > managed
> > > > > > > > > memory,
> > > > > > > > > > > > etc.)
> > > > > > > > > > > > > is larger than total process memory, how do we deal
> > > with
> > > > > this
> > > > > > > > > > > situation?
> > > > > > > > > > > > Do
> > > > > > > > > > > > > we need to check the memory configuration in
> client?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> > > > > 下午10:14写道:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We would like to start a discussion thread on
> > > "FLIP-49:
> > > > > > > Unified
> > > > > > > > > > > Memory
> > > > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> > > describe
> > > > > how
> > > > > > to
> > > > > > > > > > improve
> > > > > > > > > > > > > > TaskExecutor memory configurations. The FLIP
> > document
> > > > is
> > > > > > > mostly
> > > > > > > > > > based
> > > > > > > > > > > > on
> > > > > > > > > > > > > an
> > > > > > > > > > > > > > early design "Memory Management and Configuration
> > > > > > > Reloaded"[2]
> > > > > > > > by
> > > > > > > > > > > > > Stephan,
> > > > > > > > > > > > > > with updates from follow-up discussions both
> online
> > > and
> > > > > > > > offline.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This FLIP addresses several shortcomings of
> current
> > > > > (Flink
> > > > > > > 1.9)
> > > > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - Different configuration for Streaming and
> > Batch.
> > > > > > > > > > > > > >    - Complex and difficult configuration of
> RocksDB
> > > in
> > > > > > > > Streaming.
> > > > > > > > > > > > > >    - Complicated, uncertain and hard to
> understand.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Key changes to solve the problems can be
> summarized
> > > as
> > > > > > > follows.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - Extend memory manager to also account for
> > memory
> > > > > usage
> > > > > > > by
> > > > > > > > > > state
> > > > > > > > > > > > > >    backends.
> > > > > > > > > > > > > >    - Modify how TaskExecutor memory is
> partitioned
> > > > > > accounted
> > > > > > > > > > > individual
> > > > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > > > >    - Simplify memory configuration options and
> > > > > calculations
> > > > > > > > > logics.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Please find more details in the FLIP wiki
> document
> > > [1].
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > (Please note that the early design doc [2] is out
> > of
> > > > > sync,
> > > > > > > and
> > > > > > > > it
> > > > > > > > > > is
> > > > > > > > > > > > > > appreciated to have the discussion in this
> mailing
> > > list
> > > > > > > > thread.)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [2]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > > >
> > > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <[hidden email]>
> > > > wrote:
> > > >
> > > > > Thanks for sharing your opinion Till.
> > > > >
> > > > > I'm also in favor of alternative 2. I was wondering whether we can
> > > avoid
> > > > > using Unsafe.allocate() for off-heap managed memory and network
> > memory
> > > > with
> > > > > alternative 3. But after giving it a second thought, I think even
> for
> > > > > alternative 3 using direct memory for off-heap managed memory could
> > > cause
> > > > > problems.
> > > > >
> > > > > Hi Yang,
> > > > >
> > > > > Regarding your concern, I think what proposed in this FLIP it to
> have
> > > > both
> > > > > off-heap managed memory and network memory allocated through
> > > > > Unsafe.allocate(), which means they are practically native memory
> and
> > > not
> > > > > limited by JVM max direct memory. The only parts of memory limited
> by
> > > JVM
> > > > > max direct memory are task off-heap memory and JVM overhead, which
> > are
> > > > > exactly alternative 2 suggests to set the JVM max direct memory to.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <
> [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Thanks for the clarification Xintong. I understand the two
> > > alternatives
> > > > > > now.
> > > > > >
> > > > > > I would be in favour of option 2 because it makes things
> explicit.
> > If
> > > > we
> > > > > > don't limit the direct memory, I fear that we might end up in a
> > > similar
> > > > > > situation as we are currently in: The user might see that her
> > process
> > > > > gets
> > > > > > killed by the OS and does not know why this is the case.
> > > Consequently,
> > > > > she
> > > > > > tries to decrease the process memory size (similar to increasing
> > the
> > > > > cutoff
> > > > > > ratio) in order to accommodate for the extra direct memory. Even
> > > worse,
> > > > > she
> > > > > > tries to decrease memory budgets which are not fully used and
> hence
> > > > won't
> > > > > > change the overall memory consumption.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Let me explain this with a concrete example Till.
> > > > > > >
> > > > > > > Let's say we have the following scenario.
> > > > > > >
> > > > > > > Total Process Memory: 1GB
> > > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed
> > > Memory
> > > > > and
> > > > > > > Network Memory): 800MB
> > > > > > >
> > > > > > >
> > > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very
> large
> > > > > value,
> > > > > > > let's say 1TB.
> > > > > > >
> > > > > > > If the actual direct memory usage of Task Off-Heap Memory and
> JVM
> > > > > > Overhead
> > > > > > > do not exceed 200MB, then alternative 2 and alternative 3
> should
> > > have
> > > > > the
> > > > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> > > reduce
> > > > > the
> > > > > > > sizes of the other memory pools.
> > > > > > >
> > > > > > > If the actual direct memory usage of Task Off-Heap Memory and
> JVM
> > > > > > > Overhead potentially exceed 200MB, then
> > > > > > >
> > > > > > >    - Alternative 2 suffers from frequent OOM. To avoid that,
> the
> > > only
> > > > > > thing
> > > > > > >    user can do is to modify the configuration and increase JVM
> > > Direct
> > > > > > > Memory
> > > > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > > > increases
> > > > > > JVM
> > > > > > >    Direct Memory to 250MB, this will reduce the total size of
> > other
> > > > > > memory
> > > > > > >    pools to 750MB, given the total process memory remains 1GB.
> > > > > > >    - For alternative 3, there is no chance of direct OOM. There
> > are
> > > > > > chances
> > > > > > >    of exceeding the total process memory limit, but given that
> > the
> > > > > > process
> > > > > > > may
> > > > > > >    not use up all the reserved native memory (Off-Heap Managed
> > > > Memory,
> > > > > > > Network
> > > > > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > > > > slightly
> > > > > > > above
> > > > > > >    yet very close to 200MB, user probably do not need to change
> > the
> > > > > > >    configurations.
> > > > > > >
> > > > > > > Therefore, I think from the user's perspective, a feasible
> > > > > configuration
> > > > > > > for alternative 2 may lead to lower resource utilization
> compared
> > > to
> > > > > > > alternative 3.
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> > > [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I guess you have to help me understand the difference between
> > > > > > > alternative 2
> > > > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > > > >
> > > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap
> > > Memory
> > > > > and
> > > > > > > JVM
> > > > > > > > Overhead. Then there is the risk that this size is too low
> > > > resulting
> > > > > > in a
> > > > > > > > lot of garbage collection and potentially an OOM.
> > > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something
> larger
> > > > than
> > > > > > > > alternative 2. This would of course reduce the sizes of the
> > other
> > > > > > memory
> > > > > > > > types.
> > > > > > > >
> > > > > > > > How would alternative 2 now result in an under utilization of
> > > > memory
> > > > > > > > compared to alternative 3? If alternative 3 strictly sets a
> > > higher
> > > > > max
> > > > > > > > direct memory size and we use only little, then I would
> expect
> > > that
> > > > > > > > alternative 3 results in memory under utilization.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Till
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi xintong,till
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Native and Direct Memory
> > > > > > > > >
> > > > > > > > > My point is setting a very large max direct memory size
> when
> > we
> > > > do
> > > > > > not
> > > > > > > > > differentiate direct and native memory. If the direct
> > > > > > memory,including
> > > > > > > > user
> > > > > > > > > direct memory and framework direct memory,could be
> calculated
> > > > > > > > > correctly,then
> > > > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Memory Calculation
> > > > > > > > >
> > > > > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> > > > memory
> > > > > > > > > configurations in client to avoid submitting successfully
> and
> > > > > failing
> > > > > > > in
> > > > > > > > > the flink master.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > >
> > > > > > > > > Yang
> > > > > > > > >
> > > > > > > > > Xintong Song <[hidden email]>于2019年8月13日 周二22:07写道:
> > > > > > > > >
> > > > > > > > > > Thanks for replying, Till.
> > > > > > > > > >
> > > > > > > > > > About MemorySegment, I think you are right that we should
> > not
> > > > > > include
> > > > > > > > > this
> > > > > > > > > > issue in the scope of this FLIP. This FLIP should
> > concentrate
> > > > on
> > > > > > how
> > > > > > > to
> > > > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > > > involvement
> > > > > > on
> > > > > > > > how
> > > > > > > > > > memory consumers use it.
> > > > > > > > > >
> > > > > > > > > > About direct memory, I think alternative 3 may not having
> > the
> > > > > same
> > > > > > > over
> > > > > > > > > > reservation issue that alternative 2 does, but at the
> cost
> > of
> > > > > risk
> > > > > > of
> > > > > > > > > over
> > > > > > > > > > using memory at the container level, which is not good.
> My
> > > > point
> > > > > is
> > > > > > > > that
> > > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not
> easy
> > > to
> > > > > > > config.
> > > > > > > > > For
> > > > > > > > > > alternative 2, users might configure them higher than
> what
> > > > > actually
> > > > > > > > > needed,
> > > > > > > > > > just to avoid getting a direct OOM. For alternative 3,
> > users
> > > do
> > > > > not
> > > > > > > get
> > > > > > > > > > direct OOM, so they may not config the two options
> > > aggressively
> > > > > > high.
> > > > > > > > But
> > > > > > > > > > the consequences are risks of overall container memory
> > usage
> > > > > > exceeds
> > > > > > > > the
> > > > > > > > > > budget.
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > > > >
> > > > > > > > > > > All in all I think it already looks quite good.
> > Concerning
> > > > the
> > > > > > > first
> > > > > > > > > open
> > > > > > > > > > > question about allocating memory segments, I was
> > wondering
> > > > > > whether
> > > > > > > > this
> > > > > > > > > > is
> > > > > > > > > > > strictly necessary to do in the context of this FLIP or
> > > > whether
> > > > > > > this
> > > > > > > > > > could
> > > > > > > > > > > be done as a follow up? Without knowing all details, I
> > > would
> > > > be
> > > > > > > > > concerned
> > > > > > > > > > > that we would widen the scope of this FLIP too much
> > because
> > > > we
> > > > > > > would
> > > > > > > > > have
> > > > > > > > > > > to touch all the existing call sites of the
> MemoryManager
> > > > where
> > > > > > we
> > > > > > > > > > allocate
> > > > > > > > > > > memory segments (this should mainly be batch
> operators).
> > > The
> > > > > > > addition
> > > > > > > > > of
> > > > > > > > > > > the memory reservation call to the MemoryManager should
> > not
> > > > be
> > > > > > > > affected
> > > > > > > > > > by
> > > > > > > > > > > this and I would hope that this is the only point of
> > > > > interaction
> > > > > > a
> > > > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > > > >
> > > > > > > > > > > Concerning the second open question about setting or
> not
> > > > > setting
> > > > > > a
> > > > > > > > max
> > > > > > > > > > > direct memory limit, I would also be interested why
> Yang
> > > Wang
> > > > > > > thinks
> > > > > > > > > > > leaving it open would be best. My concern about this
> > would
> > > be
> > > > > > that
> > > > > > > we
> > > > > > > > > > would
> > > > > > > > > > > be in a similar situation as we are now with the
> > > > > > > RocksDBStateBackend.
> > > > > > > > > If
> > > > > > > > > > > the different memory pools are not clearly separated
> and
> > > can
> > > > > > spill
> > > > > > > > over
> > > > > > > > > > to
> > > > > > > > > > > a different pool, then it is quite hard to understand
> > what
> > > > > > exactly
> > > > > > > > > > causes a
> > > > > > > > > > > process to get killed for using too much memory. This
> > could
> > > > > then
> > > > > > > > easily
> > > > > > > > > > > lead to a similar situation what we have with the
> > > > cutoff-ratio.
> > > > > > So
> > > > > > > > why
> > > > > > > > > > not
> > > > > > > > > > > setting a sane default value for max direct memory and
> > > giving
> > > > > the
> > > > > > > > user
> > > > > > > > > an
> > > > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > > > >
> > > > > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > > > > utilization
> > > > > > > > than
> > > > > > > > > > > alternative 3 where we set the direct memory to a
> higher
> > > > value?
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Till
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > > > [hidden email]
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > > > >
> > > > > > > > > > > > Regarding your comments:
> > > > > > > > > > > >
> > > > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > > > I think setting a very large max direct memory size
> > > > > definitely
> > > > > > > has
> > > > > > > > > some
> > > > > > > > > > > > good sides. E.g., we do not worry about direct OOM,
> and
> > > we
> > > > > > don't
> > > > > > > > even
> > > > > > > > > > > need
> > > > > > > > > > > > to allocate managed / network memory with
> > > > Unsafe.allocate() .
> > > > > > > > > > > > However, there are also some down sides of doing
> this.
> > > > > > > > > > > >
> > > > > > > > > > > >    - One thing I can think of is that if a task
> > executor
> > > > > > > container
> > > > > > > > is
> > > > > > > > > > > >    killed due to overusing memory, it could be hard
> for
> > > use
> > > > > to
> > > > > > > know
> > > > > > > > > > which
> > > > > > > > > > > > part
> > > > > > > > > > > >    of the memory is overused.
> > > > > > > > > > > >    - Another down side is that the JVM never trigger
> GC
> > > due
> > > > > to
> > > > > > > > > reaching
> > > > > > > > > > > max
> > > > > > > > > > > >    direct memory limit, because the limit is too high
> > to
> > > be
> > > > > > > > reached.
> > > > > > > > > > That
> > > > > > > > > > > >    means we kind of relay on heap memory to trigger
> GC
> > > and
> > > > > > > release
> > > > > > > > > > direct
> > > > > > > > > > > >    memory. That could be a problem in cases where we
> > have
> > > > > more
> > > > > > > > direct
> > > > > > > > > > > > memory
> > > > > > > > > > > >    usage but not enough heap activity to trigger the
> > GC.
> > > > > > > > > > > >
> > > > > > > > > > > > Maybe you can share your reasons for preferring
> > setting a
> > > > > very
> > > > > > > > large
> > > > > > > > > > > value,
> > > > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > > > >
> > > > > > > > > > > > *Memory Calculation*
> > > > > > > > > > > > If there is any conflict between multiple
> configuration
> > > > that
> > > > > > user
> > > > > > > > > > > > explicitly specified, I think we should throw an
> error.
> > > > > > > > > > > > I think doing checking on the client side is a good
> > idea,
> > > > so
> > > > > > that
> > > > > > > > on
> > > > > > > > > > > Yarn /
> > > > > > > > > > > > K8s we can discover the problem before submitting the
> > > Flink
> > > > > > > > cluster,
> > > > > > > > > > > which
> > > > > > > > > > > > is always a good thing.
> > > > > > > > > > > > But we can not only rely on the client side checking,
> > > > because
> > > > > > for
> > > > > > > > > > > > standalone cluster TaskManagers on different machines
> > may
> > > > > have
> > > > > > > > > > different
> > > > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > > > What do you think?
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi xintong,
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for your detailed proposal. After all the
> > memory
> > > > > > > > > configuration
> > > > > > > > > > > are
> > > > > > > > > > > > > introduced, it will be more powerful to control the
> > > flink
> > > > > > > memory
> > > > > > > > > > > usage. I
> > > > > > > > > > > > > just have few questions about it.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > > > >
> > > > > > > > > > > > > We do not differentiate user direct memory and
> native
> > > > > memory.
> > > > > > > > They
> > > > > > > > > > are
> > > > > > > > > > > > all
> > > > > > > > > > > > > included in task off-heap memory. Right? So i don’t
> > > think
> > > > > we
> > > > > > > > could
> > > > > > > > > > not
> > > > > > > > > > > > set
> > > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer
> > leaving
> > > > it a
> > > > > > > very
> > > > > > > > > > large
> > > > > > > > > > > > > value.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > > > >
> > > > > > > > > > > > > If the sum of and fine-grained memory(network
> memory,
> > > > > managed
> > > > > > > > > memory,
> > > > > > > > > > > > etc.)
> > > > > > > > > > > > > is larger than total process memory, how do we deal
> > > with
> > > > > this
> > > > > > > > > > > situation?
> > > > > > > > > > > > Do
> > > > > > > > > > > > > we need to check the memory configuration in
> client?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong Song <[hidden email]> 于2019年8月7日周三
> > > > > 下午10:14写道:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We would like to start a discussion thread on
> > > "FLIP-49:
> > > > > > > Unified
> > > > > > > > > > > Memory
> > > > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> > > describe
> > > > > how
> > > > > > to
> > > > > > > > > > improve
> > > > > > > > > > > > > > TaskExecutor memory configurations. The FLIP
> > document
> > > > is
> > > > > > > mostly
> > > > > > > > > > based
> > > > > > > > > > > > on
> > > > > > > > > > > > > an
> > > > > > > > > > > > > > early design "Memory Management and Configuration
> > > > > > > Reloaded"[2]
> > > > > > > > by
> > > > > > > > > > > > > Stephan,
> > > > > > > > > > > > > > with updates from follow-up discussions both
> online
> > > and
> > > > > > > > offline.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This FLIP addresses several shortcomings of
> current
> > > > > (Flink
> > > > > > > 1.9)
> > > > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - Different configuration for Streaming and
> > Batch.
> > > > > > > > > > > > > >    - Complex and difficult configuration of
> RocksDB
> > > in
> > > > > > > > Streaming.
> > > > > > > > > > > > > >    - Complicated, uncertain and hard to
> understand.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Key changes to solve the problems can be
> summarized
> > > as
> > > > > > > follows.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - Extend memory manager to also account for
> > memory
> > > > > usage
> > > > > > > by
> > > > > > > > > > state
> > > > > > > > > > > > > >    backends.
> > > > > > > > > > > > > >    - Modify how TaskExecutor memory is
> partitioned
> > > > > > accounted
> > > > > > > > > > > individual
> > > > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > > > >    - Simplify memory configuration options and
> > > > > calculations
> > > > > > > > > logics.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Please find more details in the FLIP wiki
> document
> > > [1].
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > (Please note that the early design doc [2] is out
> > of
> > > > > sync,
> > > > > > > and
> > > > > > > > it
> > > > > > > > > > is
> > > > > > > > > > > > > > appreciated to have the discussion in this
> mailing
> > > list
> > > > > > > > thread.)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [2]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Xintong Song
Hi everyone,

I just updated the FLIP document on wiki [1], with the following changes.

   - Removed open question regarding MemorySegment allocation. As
   discussed, we exclude this topic from the scope of this FLIP.
   - Updated content about JVM direct memory parameter according to recent
   discussions, and moved the other options to "Rejected Alternatives" for the
   moment.
   - Added implementation steps.


Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors

On Mon, Aug 19, 2019 at 7:16 PM Stephan Ewen <[hidden email]> wrote:

> @Xintong: Concerning "wait for memory users before task dispose and memory
> release": I agree, that's how it should be. Let's try it out.
>
> @Xintong @Jingsong: Concerning " JVM does not wait for GC when allocating
> direct memory buffer": There seems to be pretty elaborate logic to free
> buffers when allocating new ones. See
>
> http://hg.openjdk.java.net/jdk8u/jdk8u-dev/jdk/file/tip/src/share/classes/java/nio/Bits.java#l643
>
> @Till: Maybe. If we assume that the JVM default works (like going with
> option 2 and not setting "-XX:MaxDirectMemorySize" at all), then I think it
> should be okay to set "-XX:MaxDirectMemorySize" to
> "off_heap_managed_memory + direct_memory" even if we use RocksDB. That is a
> big if, though, I honestly have no idea :D Would be good to understand
> this, though, because this would affect option (2) and option (1.2).
>
> On Mon, Aug 19, 2019 at 4:44 PM Xintong Song <[hidden email]>
> wrote:
>
> > Thanks for the inputs, Jingsong.
> >
> > Let me try to summarize your points. Please correct me if I'm wrong.
> >
> >    - Memory consumers should always avoid returning memory segments to
> >    memory manager while there are still un-cleaned structures / threads
> > that
> >    may use the memory. Otherwise, it would cause serious problems by
> having
> >    multiple consumers trying to use the same memory segment.
> >    - JVM does not wait for GC when allocating direct memory buffer.
> >    Therefore even we set proper max direct memory size limit, we may
> still
> >    encounter direct memory oom if the GC cleaning memory slower than the
> >    direct memory allocation.
> >
> > Am I understanding this correctly?
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Mon, Aug 19, 2019 at 4:21 PM JingsongLee <[hidden email]
> > .invalid>
> > wrote:
> >
> > > Hi stephan:
> > >
> > > About option 2:
> > >
> > > if additional threads not cleanly shut down before we can exit the
> task:
> > > In the current case of memory reuse, it has freed up the memory it
> > >  uses. If this memory is used by other tasks and asynchronous threads
> > >  of exited task may still be writing, there will be concurrent security
> > >  problems, and even lead to errors in user computing results.
> > >
> > > So I think this is a serious and intolerable bug, No matter what the
> > >  option is, it should be avoided.
> > >
> > > About direct memory cleaned by GC:
> > > I don't think it is a good idea, I've encountered so many situations
> > >  that it's too late for GC to cause DirectMemory OOM. Release and
> > >  allocate DirectMemory depend on the type of user job, which is
> > >  often beyond our control.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > >
> > > ------------------------------------------------------------------
> > > From:Stephan Ewen <[hidden email]>
> > > Send Time:2019年8月19日(星期一) 15:56
> > > To:dev <[hidden email]>
> > > Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for
> > > TaskExecutors
> > >
> > > My main concern with option 2 (manually release memory) is that
> segfaults
> > > in the JVM send off all sorts of alarms on user ends. So we need to
> > > guarantee that this never happens.
> > >
> > > The trickyness is in tasks that uses data structures / algorithms with
> > > additional threads, like hash table spill/read and sorting threads. We
> > need
> > > to ensure that these cleanly shut down before we can exit the task.
> > > I am not sure that we have that guaranteed already, that's why option
> 1.1
> > > seemed simpler to me.
> > >
> > > On Mon, Aug 19, 2019 at 3:42 PM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for the comments, Stephan. Summarized in this way really makes
> > > > things easier to understand.
> > > >
> > > > I'm in favor of option 2, at least for the moment. I think it is not
> > that
> > > > difficult to keep it segfault safe for memory manager, as long as we
> > > always
> > > > de-allocate the memory segment when it is released from the memory
> > > > consumers. Only if the memory consumer continue using the buffer of
> > > memory
> > > > segment after releasing it, in which case we do want the job to fail
> so
> > > we
> > > > detect the memory leak early.
> > > >
> > > > For option 1.2, I don't think this is a good idea. Not only because
> the
> > > > assumption (regular GC is enough to clean direct buffers) may not
> > always
> > > be
> > > > true, but also it makes harder for finding problems in cases of
> memory
> > > > overuse. E.g., user configured some direct memory for the user
> > libraries.
> > > > If the library actually use more direct memory then configured, which
> > > > cannot be cleaned by GC because they are still in use, may lead to
> > > overuse
> > > > of the total container memory. In that case, if it didn't touch the
> JVM
> > > > default max direct memory limit, we cannot get a direct memory OOM
> and
> > it
> > > > will become super hard to understand which part of the configuration
> > need
> > > > to be updated.
> > > >
> > > > For option 1.1, it has the similar problem as 1.2, if the exceeded
> > direct
> > > > memory does not reach the max direct memory limit specified by the
> > > > dedicated parameter. I think it is slightly better than 1.2, only
> > because
> > > > we can tune the parameter.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen <[hidden email]>
> wrote:
> > > >
> > > > > About the "-XX:MaxDirectMemorySize" discussion, maybe let me
> > summarize
> > > > it a
> > > > > bit differently:
> > > > >
> > > > > We have the following two options:
> > > > >
> > > > > (1) We let MemorySegments be de-allocated by the GC. That makes it
> > > > segfault
> > > > > safe. But then we need a way to trigger GC in case de-allocation
> and
> > > > > re-allocation of a bunch of segments happens quickly, which is
> often
> > > the
> > > > > case during batch scheduling or task restart.
> > > > >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do
> this
> > > > >   - Another way could be to have a dedicated bookkeeping in the
> > > > > MemoryManager (option 1.2), so that this is a number independent of
> > the
> > > > > "-XX:MaxDirectMemorySize" parameter.
> > > > >
> > > > > (2) We manually allocate and de-allocate the memory for the
> > > > MemorySegments
> > > > > (option 2). That way we need not worry about triggering GC by some
> > > > > threshold or bookkeeping, but it is harder to prevent segfaults. We
> > > need
> > > > to
> > > > > be very careful about when we release the memory segments (only in
> > the
> > > > > cleanup phase of the main thread).
> > > > >
> > > > > If we go with option 1.1, we probably need to set
> > > > > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> > direct_memory"
> > > > and
> > > > > have "direct_memory" as a separate reserved memory pool. Because if
> > we
> > > > just
> > > > > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> > > > jvm_overhead",
> > > > > then there will be times when that entire memory is allocated by
> > direct
> > > > > buffers and we have nothing left for the JVM overhead. So we either
> > > need
> > > > a
> > > > > way to compensate for that (again some safety margin cutoff value)
> or
> > > we
> > > > > will exceed container memory.
> > > > >
> > > > > If we go with option 1.2, we need to be aware that it takes
> elaborate
> > > > logic
> > > > > to push recycling of direct buffers without always triggering a
> full
> > > GC.
> > > > >
> > > > >
> > > > > My first guess is that the options will be easiest to do in the
> > > following
> > > > > order:
> > > > >
> > > > >   - Option 1.1 with a dedicated direct_memory parameter, as
> discussed
> > > > > above. We would need to find a way to set the direct_memory
> parameter
> > > by
> > > > > default. We could start with 64 MB and see how it goes in practice.
> > One
> > > > > danger I see is that setting this loo low can cause a bunch of
> > > additional
> > > > > GCs compared to before (we need to watch this carefully).
> > > > >
> > > > >   - Option 2. It is actually quite simple to implement, we could
> try
> > > how
> > > > > segfault safe we are at the moment.
> > > > >
> > > > >   - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize"
> > > > parameter
> > > > > at all and assume that all the direct memory allocations that the
> JVM
> > > and
> > > > > Netty do are infrequent enough to be cleaned up fast enough through
> > > > regular
> > > > > GC. I am not sure if that is a valid assumption, though.
> > > > >
> > > > > Best,
> > > > > Stephan
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <
> [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Thanks for sharing your opinion Till.
> > > > > >
> > > > > > I'm also in favor of alternative 2. I was wondering whether we
> can
> > > > avoid
> > > > > > using Unsafe.allocate() for off-heap managed memory and network
> > > memory
> > > > > with
> > > > > > alternative 3. But after giving it a second thought, I think even
> > for
> > > > > > alternative 3 using direct memory for off-heap managed memory
> could
> > > > cause
> > > > > > problems.
> > > > > >
> > > > > > Hi Yang,
> > > > > >
> > > > > > Regarding your concern, I think what proposed in this FLIP it to
> > have
> > > > > both
> > > > > > off-heap managed memory and network memory allocated through
> > > > > > Unsafe.allocate(), which means they are practically native memory
> > and
> > > > not
> > > > > > limited by JVM max direct memory. The only parts of memory
> limited
> > by
> > > > JVM
> > > > > > max direct memory are task off-heap memory and JVM overhead,
> which
> > > are
> > > > > > exactly alternative 2 suggests to set the JVM max direct memory
> to.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <
> > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for the clarification Xintong. I understand the two
> > > > alternatives
> > > > > > > now.
> > > > > > >
> > > > > > > I would be in favour of option 2 because it makes things
> > explicit.
> > > If
> > > > > we
> > > > > > > don't limit the direct memory, I fear that we might end up in a
> > > > similar
> > > > > > > situation as we are currently in: The user might see that her
> > > process
> > > > > > gets
> > > > > > > killed by the OS and does not know why this is the case.
> > > > Consequently,
> > > > > > she
> > > > > > > tries to decrease the process memory size (similar to
> increasing
> > > the
> > > > > > cutoff
> > > > > > > ratio) in order to accommodate for the extra direct memory.
> Even
> > > > worse,
> > > > > > she
> > > > > > > tries to decrease memory budgets which are not fully used and
> > hence
> > > > > won't
> > > > > > > change the overall memory consumption.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <
> > > [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Let me explain this with a concrete example Till.
> > > > > > > >
> > > > > > > > Let's say we have the following scenario.
> > > > > > > >
> > > > > > > > Total Process Memory: 1GB
> > > > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead):
> 200MB
> > > > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap
> Managed
> > > > Memory
> > > > > > and
> > > > > > > > Network Memory): 800MB
> > > > > > > >
> > > > > > > >
> > > > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very
> > large
> > > > > > value,
> > > > > > > > let's say 1TB.
> > > > > > > >
> > > > > > > > If the actual direct memory usage of Task Off-Heap Memory and
> > JVM
> > > > > > > Overhead
> > > > > > > > do not exceed 200MB, then alternative 2 and alternative 3
> > should
> > > > have
> > > > > > the
> > > > > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> > > > reduce
> > > > > > the
> > > > > > > > sizes of the other memory pools.
> > > > > > > >
> > > > > > > > If the actual direct memory usage of Task Off-Heap Memory and
> > JVM
> > > > > > > > Overhead potentially exceed 200MB, then
> > > > > > > >
> > > > > > > >    - Alternative 2 suffers from frequent OOM. To avoid that,
> > the
> > > > only
> > > > > > > thing
> > > > > > > >    user can do is to modify the configuration and increase
> JVM
> > > > Direct
> > > > > > > > Memory
> > > > > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > > > > increases
> > > > > > > JVM
> > > > > > > >    Direct Memory to 250MB, this will reduce the total size of
> > > other
> > > > > > > memory
> > > > > > > >    pools to 750MB, given the total process memory remains
> 1GB.
> > > > > > > >    - For alternative 3, there is no chance of direct OOM.
> There
> > > are
> > > > > > > chances
> > > > > > > >    of exceeding the total process memory limit, but given
> that
> > > the
> > > > > > > process
> > > > > > > > may
> > > > > > > >    not use up all the reserved native memory (Off-Heap
> Managed
> > > > > Memory,
> > > > > > > > Network
> > > > > > > >    Memory, JVM Metaspace), if the actual direct memory usage
> is
> > > > > > slightly
> > > > > > > > above
> > > > > > > >    yet very close to 200MB, user probably do not need to
> change
> > > the
> > > > > > > >    configurations.
> > > > > > > >
> > > > > > > > Therefore, I think from the user's perspective, a feasible
> > > > > > configuration
> > > > > > > > for alternative 2 may lead to lower resource utilization
> > compared
> > > > to
> > > > > > > > alternative 3.
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> > > > [hidden email]
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I guess you have to help me understand the difference
> between
> > > > > > > > alternative 2
> > > > > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > > > > >
> > > > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task
> Off-Heap
> > > > Memory
> > > > > > and
> > > > > > > > JVM
> > > > > > > > > Overhead. Then there is the risk that this size is too low
> > > > > resulting
> > > > > > > in a
> > > > > > > > > lot of garbage collection and potentially an OOM.
> > > > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something
> > larger
> > > > > than
> > > > > > > > > alternative 2. This would of course reduce the sizes of the
> > > other
> > > > > > > memory
> > > > > > > > > types.
> > > > > > > > >
> > > > > > > > > How would alternative 2 now result in an under utilization
> of
> > > > > memory
> > > > > > > > > compared to alternative 3? If alternative 3 strictly sets a
> > > > higher
> > > > > > max
> > > > > > > > > direct memory size and we use only little, then I would
> > expect
> > > > that
> > > > > > > > > alternative 3 results in memory under utilization.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Till
> > > > > > > > >
> > > > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <
> > > [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi xintong,till
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Native and Direct Memory
> > > > > > > > > >
> > > > > > > > > > My point is setting a very large max direct memory size
> > when
> > > we
> > > > > do
> > > > > > > not
> > > > > > > > > > differentiate direct and native memory. If the direct
> > > > > > > memory,including
> > > > > > > > > user
> > > > > > > > > > direct memory and framework direct memory,could be
> > calculated
> > > > > > > > > > correctly,then
> > > > > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Memory Calculation
> > > > > > > > > >
> > > > > > > > > > I agree with xintong. For Yarn and k8s,we need to check
> the
> > > > > memory
> > > > > > > > > > configurations in client to avoid submitting successfully
> > and
> > > > > > failing
> > > > > > > > in
> > > > > > > > > > the flink master.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > >
> > > > > > > > > > Yang
> > > > > > > > > >
> > > > > > > > > > Xintong Song <[hidden email]>于2019年8月13日
> 周二22:07写道:
> > > > > > > > > >
> > > > > > > > > > > Thanks for replying, Till.
> > > > > > > > > > >
> > > > > > > > > > > About MemorySegment, I think you are right that we
> should
> > > not
> > > > > > > include
> > > > > > > > > > this
> > > > > > > > > > > issue in the scope of this FLIP. This FLIP should
> > > concentrate
> > > > > on
> > > > > > > how
> > > > > > > > to
> > > > > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > > > > involvement
> > > > > > > on
> > > > > > > > > how
> > > > > > > > > > > memory consumers use it.
> > > > > > > > > > >
> > > > > > > > > > > About direct memory, I think alternative 3 may not
> having
> > > the
> > > > > > same
> > > > > > > > over
> > > > > > > > > > > reservation issue that alternative 2 does, but at the
> > cost
> > > of
> > > > > > risk
> > > > > > > of
> > > > > > > > > > over
> > > > > > > > > > > using memory at the container level, which is not good.
> > My
> > > > > point
> > > > > > is
> > > > > > > > > that
> > > > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not
> > easy
> > > > to
> > > > > > > > config.
> > > > > > > > > > For
> > > > > > > > > > > alternative 2, users might configure them higher than
> > what
> > > > > > actually
> > > > > > > > > > needed,
> > > > > > > > > > > just to avoid getting a direct OOM. For alternative 3,
> > > users
> > > > do
> > > > > > not
> > > > > > > > get
> > > > > > > > > > > direct OOM, so they may not config the two options
> > > > aggressively
> > > > > > > high.
> > > > > > > > > But
> > > > > > > > > > > the consequences are risks of overall container memory
> > > usage
> > > > > > > exceeds
> > > > > > > > > the
> > > > > > > > > > > budget.
> > > > > > > > > > >
> > > > > > > > > > > Thank you~
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > > > > >
> > > > > > > > > > > > All in all I think it already looks quite good.
> > > Concerning
> > > > > the
> > > > > > > > first
> > > > > > > > > > open
> > > > > > > > > > > > question about allocating memory segments, I was
> > > wondering
> > > > > > > whether
> > > > > > > > > this
> > > > > > > > > > > is
> > > > > > > > > > > > strictly necessary to do in the context of this FLIP
> or
> > > > > whether
> > > > > > > > this
> > > > > > > > > > > could
> > > > > > > > > > > > be done as a follow up? Without knowing all details,
> I
> > > > would
> > > > > be
> > > > > > > > > > concerned
> > > > > > > > > > > > that we would widen the scope of this FLIP too much
> > > because
> > > > > we
> > > > > > > > would
> > > > > > > > > > have
> > > > > > > > > > > > to touch all the existing call sites of the
> > MemoryManager
> > > > > where
> > > > > > > we
> > > > > > > > > > > allocate
> > > > > > > > > > > > memory segments (this should mainly be batch
> > operators).
> > > > The
> > > > > > > > addition
> > > > > > > > > > of
> > > > > > > > > > > > the memory reservation call to the MemoryManager
> should
> > > not
> > > > > be
> > > > > > > > > affected
> > > > > > > > > > > by
> > > > > > > > > > > > this and I would hope that this is the only point of
> > > > > > interaction
> > > > > > > a
> > > > > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > > > > >
> > > > > > > > > > > > Concerning the second open question about setting or
> > not
> > > > > > setting
> > > > > > > a
> > > > > > > > > max
> > > > > > > > > > > > direct memory limit, I would also be interested why
> > Yang
> > > > Wang
> > > > > > > > thinks
> > > > > > > > > > > > leaving it open would be best. My concern about this
> > > would
> > > > be
> > > > > > > that
> > > > > > > > we
> > > > > > > > > > > would
> > > > > > > > > > > > be in a similar situation as we are now with the
> > > > > > > > RocksDBStateBackend.
> > > > > > > > > > If
> > > > > > > > > > > > the different memory pools are not clearly separated
> > and
> > > > can
> > > > > > > spill
> > > > > > > > > over
> > > > > > > > > > > to
> > > > > > > > > > > > a different pool, then it is quite hard to understand
> > > what
> > > > > > > exactly
> > > > > > > > > > > causes a
> > > > > > > > > > > > process to get killed for using too much memory. This
> > > could
> > > > > > then
> > > > > > > > > easily
> > > > > > > > > > > > lead to a similar situation what we have with the
> > > > > cutoff-ratio.
> > > > > > > So
> > > > > > > > > why
> > > > > > > > > > > not
> > > > > > > > > > > > setting a sane default value for max direct memory
> and
> > > > giving
> > > > > > the
> > > > > > > > > user
> > > > > > > > > > an
> > > > > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > > > > >
> > > > > > > > > > > > @Xintong, how would alternative 2 lead to lower
> memory
> > > > > > > utilization
> > > > > > > > > than
> > > > > > > > > > > > alternative 3 where we set the direct memory to a
> > higher
> > > > > value?
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Till
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > > > > [hidden email]
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regarding your comments:
> > > > > > > > > > > > >
> > > > > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > > > > I think setting a very large max direct memory size
> > > > > > definitely
> > > > > > > > has
> > > > > > > > > > some
> > > > > > > > > > > > > good sides. E.g., we do not worry about direct OOM,
> > and
> > > > we
> > > > > > > don't
> > > > > > > > > even
> > > > > > > > > > > > need
> > > > > > > > > > > > > to allocate managed / network memory with
> > > > > Unsafe.allocate() .
> > > > > > > > > > > > > However, there are also some down sides of doing
> > this.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - One thing I can think of is that if a task
> > > executor
> > > > > > > > container
> > > > > > > > > is
> > > > > > > > > > > > >    killed due to overusing memory, it could be hard
> > for
> > > > use
> > > > > > to
> > > > > > > > know
> > > > > > > > > > > which
> > > > > > > > > > > > > part
> > > > > > > > > > > > >    of the memory is overused.
> > > > > > > > > > > > >    - Another down side is that the JVM never
> trigger
> > GC
> > > > due
> > > > > > to
> > > > > > > > > > reaching
> > > > > > > > > > > > max
> > > > > > > > > > > > >    direct memory limit, because the limit is too
> high
> > > to
> > > > be
> > > > > > > > > reached.
> > > > > > > > > > > That
> > > > > > > > > > > > >    means we kind of relay on heap memory to trigger
> > GC
> > > > and
> > > > > > > > release
> > > > > > > > > > > direct
> > > > > > > > > > > > >    memory. That could be a problem in cases where
> we
> > > have
> > > > > > more
> > > > > > > > > direct
> > > > > > > > > > > > > memory
> > > > > > > > > > > > >    usage but not enough heap activity to trigger
> the
> > > GC.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Maybe you can share your reasons for preferring
> > > setting a
> > > > > > very
> > > > > > > > > large
> > > > > > > > > > > > value,
> > > > > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > > > > >
> > > > > > > > > > > > > *Memory Calculation*
> > > > > > > > > > > > > If there is any conflict between multiple
> > configuration
> > > > > that
> > > > > > > user
> > > > > > > > > > > > > explicitly specified, I think we should throw an
> > error.
> > > > > > > > > > > > > I think doing checking on the client side is a good
> > > idea,
> > > > > so
> > > > > > > that
> > > > > > > > > on
> > > > > > > > > > > > Yarn /
> > > > > > > > > > > > > K8s we can discover the problem before submitting
> the
> > > > Flink
> > > > > > > > > cluster,
> > > > > > > > > > > > which
> > > > > > > > > > > > > is always a good thing.
> > > > > > > > > > > > > But we can not only rely on the client side
> checking,
> > > > > because
> > > > > > > for
> > > > > > > > > > > > > standalone cluster TaskManagers on different
> machines
> > > may
> > > > > > have
> > > > > > > > > > > different
> > > > > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > > > > What do you think?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi xintong,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for your detailed proposal. After all the
> > > memory
> > > > > > > > > > configuration
> > > > > > > > > > > > are
> > > > > > > > > > > > > > introduced, it will be more powerful to control
> the
> > > > flink
> > > > > > > > memory
> > > > > > > > > > > > usage. I
> > > > > > > > > > > > > > just have few questions about it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We do not differentiate user direct memory and
> > native
> > > > > > memory.
> > > > > > > > > They
> > > > > > > > > > > are
> > > > > > > > > > > > > all
> > > > > > > > > > > > > > included in task off-heap memory. Right? So i
> don’t
> > > > think
> > > > > > we
> > > > > > > > > could
> > > > > > > > > > > not
> > > > > > > > > > > > > set
> > > > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer
> > > leaving
> > > > > it a
> > > > > > > > very
> > > > > > > > > > > large
> > > > > > > > > > > > > > value.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If the sum of and fine-grained memory(network
> > memory,
> > > > > > managed
> > > > > > > > > > memory,
> > > > > > > > > > > > > etc.)
> > > > > > > > > > > > > > is larger than total process memory, how do we
> deal
> > > > with
> > > > > > this
> > > > > > > > > > > > situation?
> > > > > > > > > > > > > Do
> > > > > > > > > > > > > > we need to check the memory configuration in
> > client?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Xintong Song <[hidden email]>
> 于2019年8月7日周三
> > > > > > 下午10:14写道:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We would like to start a discussion thread on
> > > > "FLIP-49:
> > > > > > > > Unified
> > > > > > > > > > > > Memory
> > > > > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> > > > describe
> > > > > > how
> > > > > > > to
> > > > > > > > > > > improve
> > > > > > > > > > > > > > > TaskExecutor memory configurations. The FLIP
> > > document
> > > > > is
> > > > > > > > mostly
> > > > > > > > > > > based
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > early design "Memory Management and
> Configuration
> > > > > > > > Reloaded"[2]
> > > > > > > > > by
> > > > > > > > > > > > > > Stephan,
> > > > > > > > > > > > > > > with updates from follow-up discussions both
> > online
> > > > and
> > > > > > > > > offline.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This FLIP addresses several shortcomings of
> > current
> > > > > > (Flink
> > > > > > > > 1.9)
> > > > > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - Different configuration for Streaming and
> > > Batch.
> > > > > > > > > > > > > > >    - Complex and difficult configuration of
> > RocksDB
> > > > in
> > > > > > > > > Streaming.
> > > > > > > > > > > > > > >    - Complicated, uncertain and hard to
> > understand.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Key changes to solve the problems can be
> > summarized
> > > > as
> > > > > > > > follows.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - Extend memory manager to also account for
> > > memory
> > > > > > usage
> > > > > > > > by
> > > > > > > > > > > state
> > > > > > > > > > > > > > >    backends.
> > > > > > > > > > > > > > >    - Modify how TaskExecutor memory is
> > partitioned
> > > > > > > accounted
> > > > > > > > > > > > individual
> > > > > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > > > > >    - Simplify memory configuration options and
> > > > > > calculations
> > > > > > > > > > logics.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Please find more details in the FLIP wiki
> > document
> > > > [1].
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > (Please note that the early design doc [2] is
> out
> > > of
> > > > > > sync,
> > > > > > > > and
> > > > > > > > > it
> > > > > > > > > > > is
> > > > > > > > > > > > > > > appreciated to have the discussion in this
> > mailing
> > > > list
> > > > > > > > > thread.)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [2]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <
> [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Thanks for sharing your opinion Till.
> > > > > >
> > > > > > I'm also in favor of alternative 2. I was wondering whether we
> can
> > > > avoid
> > > > > > using Unsafe.allocate() for off-heap managed memory and network
> > > memory
> > > > > with
> > > > > > alternative 3. But after giving it a second thought, I think even
> > for
> > > > > > alternative 3 using direct memory for off-heap managed memory
> could
> > > > cause
> > > > > > problems.
> > > > > >
> > > > > > Hi Yang,
> > > > > >
> > > > > > Regarding your concern, I think what proposed in this FLIP it to
> > have
> > > > > both
> > > > > > off-heap managed memory and network memory allocated through
> > > > > > Unsafe.allocate(), which means they are practically native memory
> > and
> > > > not
> > > > > > limited by JVM max direct memory. The only parts of memory
> limited
> > by
> > > > JVM
> > > > > > max direct memory are task off-heap memory and JVM overhead,
> which
> > > are
> > > > > > exactly alternative 2 suggests to set the JVM max direct memory
> to.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <
> > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for the clarification Xintong. I understand the two
> > > > alternatives
> > > > > > > now.
> > > > > > >
> > > > > > > I would be in favour of option 2 because it makes things
> > explicit.
> > > If
> > > > > we
> > > > > > > don't limit the direct memory, I fear that we might end up in a
> > > > similar
> > > > > > > situation as we are currently in: The user might see that her
> > > process
> > > > > > gets
> > > > > > > killed by the OS and does not know why this is the case.
> > > > Consequently,
> > > > > > she
> > > > > > > tries to decrease the process memory size (similar to
> increasing
> > > the
> > > > > > cutoff
> > > > > > > ratio) in order to accommodate for the extra direct memory.
> Even
> > > > worse,
> > > > > > she
> > > > > > > tries to decrease memory budgets which are not fully used and
> > hence
> > > > > won't
> > > > > > > change the overall memory consumption.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <
> > > [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Let me explain this with a concrete example Till.
> > > > > > > >
> > > > > > > > Let's say we have the following scenario.
> > > > > > > >
> > > > > > > > Total Process Memory: 1GB
> > > > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead):
> 200MB
> > > > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap
> Managed
> > > > Memory
> > > > > > and
> > > > > > > > Network Memory): 800MB
> > > > > > > >
> > > > > > > >
> > > > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very
> > large
> > > > > > value,
> > > > > > > > let's say 1TB.
> > > > > > > >
> > > > > > > > If the actual direct memory usage of Task Off-Heap Memory and
> > JVM
> > > > > > > Overhead
> > > > > > > > do not exceed 200MB, then alternative 2 and alternative 3
> > should
> > > > have
> > > > > > the
> > > > > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> > > > reduce
> > > > > > the
> > > > > > > > sizes of the other memory pools.
> > > > > > > >
> > > > > > > > If the actual direct memory usage of Task Off-Heap Memory and
> > JVM
> > > > > > > > Overhead potentially exceed 200MB, then
> > > > > > > >
> > > > > > > >    - Alternative 2 suffers from frequent OOM. To avoid that,
> > the
> > > > only
> > > > > > > thing
> > > > > > > >    user can do is to modify the configuration and increase
> JVM
> > > > Direct
> > > > > > > > Memory
> > > > > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > > > > increases
> > > > > > > JVM
> > > > > > > >    Direct Memory to 250MB, this will reduce the total size of
> > > other
> > > > > > > memory
> > > > > > > >    pools to 750MB, given the total process memory remains
> 1GB.
> > > > > > > >    - For alternative 3, there is no chance of direct OOM.
> There
> > > are
> > > > > > > chances
> > > > > > > >    of exceeding the total process memory limit, but given
> that
> > > the
> > > > > > > process
> > > > > > > > may
> > > > > > > >    not use up all the reserved native memory (Off-Heap
> Managed
> > > > > Memory,
> > > > > > > > Network
> > > > > > > >    Memory, JVM Metaspace), if the actual direct memory usage
> is
> > > > > > slightly
> > > > > > > > above
> > > > > > > >    yet very close to 200MB, user probably do not need to
> change
> > > the
> > > > > > > >    configurations.
> > > > > > > >
> > > > > > > > Therefore, I think from the user's perspective, a feasible
> > > > > > configuration
> > > > > > > > for alternative 2 may lead to lower resource utilization
> > compared
> > > > to
> > > > > > > > alternative 3.
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> > > > [hidden email]
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I guess you have to help me understand the difference
> between
> > > > > > > > alternative 2
> > > > > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > > > > >
> > > > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task
> Off-Heap
> > > > Memory
> > > > > > and
> > > > > > > > JVM
> > > > > > > > > Overhead. Then there is the risk that this size is too low
> > > > > resulting
> > > > > > > in a
> > > > > > > > > lot of garbage collection and potentially an OOM.
> > > > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something
> > larger
> > > > > than
> > > > > > > > > alternative 2. This would of course reduce the sizes of the
> > > other
> > > > > > > memory
> > > > > > > > > types.
> > > > > > > > >
> > > > > > > > > How would alternative 2 now result in an under utilization
> of
> > > > > memory
> > > > > > > > > compared to alternative 3? If alternative 3 strictly sets a
> > > > higher
> > > > > > max
> > > > > > > > > direct memory size and we use only little, then I would
> > expect
> > > > that
> > > > > > > > > alternative 3 results in memory under utilization.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Till
> > > > > > > > >
> > > > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <
> > > [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi xintong,till
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Native and Direct Memory
> > > > > > > > > >
> > > > > > > > > > My point is setting a very large max direct memory size
> > when
> > > we
> > > > > do
> > > > > > > not
> > > > > > > > > > differentiate direct and native memory. If the direct
> > > > > > > memory,including
> > > > > > > > > user
> > > > > > > > > > direct memory and framework direct memory,could be
> > calculated
> > > > > > > > > > correctly,then
> > > > > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Memory Calculation
> > > > > > > > > >
> > > > > > > > > > I agree with xintong. For Yarn and k8s,we need to check
> the
> > > > > memory
> > > > > > > > > > configurations in client to avoid submitting successfully
> > and
> > > > > > failing
> > > > > > > > in
> > > > > > > > > > the flink master.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > >
> > > > > > > > > > Yang
> > > > > > > > > >
> > > > > > > > > > Xintong Song <[hidden email]>于2019年8月13日
> 周二22:07写道:
> > > > > > > > > >
> > > > > > > > > > > Thanks for replying, Till.
> > > > > > > > > > >
> > > > > > > > > > > About MemorySegment, I think you are right that we
> should
> > > not
> > > > > > > include
> > > > > > > > > > this
> > > > > > > > > > > issue in the scope of this FLIP. This FLIP should
> > > concentrate
> > > > > on
> > > > > > > how
> > > > > > > > to
> > > > > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > > > > involvement
> > > > > > > on
> > > > > > > > > how
> > > > > > > > > > > memory consumers use it.
> > > > > > > > > > >
> > > > > > > > > > > About direct memory, I think alternative 3 may not
> having
> > > the
> > > > > > same
> > > > > > > > over
> > > > > > > > > > > reservation issue that alternative 2 does, but at the
> > cost
> > > of
> > > > > > risk
> > > > > > > of
> > > > > > > > > > over
> > > > > > > > > > > using memory at the container level, which is not good.
> > My
> > > > > point
> > > > > > is
> > > > > > > > > that
> > > > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not
> > easy
> > > > to
> > > > > > > > config.
> > > > > > > > > > For
> > > > > > > > > > > alternative 2, users might configure them higher than
> > what
> > > > > > actually
> > > > > > > > > > needed,
> > > > > > > > > > > just to avoid getting a direct OOM. For alternative 3,
> > > users
> > > > do
> > > > > > not
> > > > > > > > get
> > > > > > > > > > > direct OOM, so they may not config the two options
> > > > aggressively
> > > > > > > high.
> > > > > > > > > But
> > > > > > > > > > > the consequences are risks of overall container memory
> > > usage
> > > > > > > exceeds
> > > > > > > > > the
> > > > > > > > > > > budget.
> > > > > > > > > > >
> > > > > > > > > > > Thank you~
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > > > > >
> > > > > > > > > > > > All in all I think it already looks quite good.
> > > Concerning
> > > > > the
> > > > > > > > first
> > > > > > > > > > open
> > > > > > > > > > > > question about allocating memory segments, I was
> > > wondering
> > > > > > > whether
> > > > > > > > > this
> > > > > > > > > > > is
> > > > > > > > > > > > strictly necessary to do in the context of this FLIP
> or
> > > > > whether
> > > > > > > > this
> > > > > > > > > > > could
> > > > > > > > > > > > be done as a follow up? Without knowing all details,
> I
> > > > would
> > > > > be
> > > > > > > > > > concerned
> > > > > > > > > > > > that we would widen the scope of this FLIP too much
> > > because
> > > > > we
> > > > > > > > would
> > > > > > > > > > have
> > > > > > > > > > > > to touch all the existing call sites of the
> > MemoryManager
> > > > > where
> > > > > > > we
> > > > > > > > > > > allocate
> > > > > > > > > > > > memory segments (this should mainly be batch
> > operators).
> > > > The
> > > > > > > > addition
> > > > > > > > > > of
> > > > > > > > > > > > the memory reservation call to the MemoryManager
> should
> > > not
> > > > > be
> > > > > > > > > affected
> > > > > > > > > > > by
> > > > > > > > > > > > this and I would hope that this is the only point of
> > > > > > interaction
> > > > > > > a
> > > > > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > > > > >
> > > > > > > > > > > > Concerning the second open question about setting or
> > not
> > > > > > setting
> > > > > > > a
> > > > > > > > > max
> > > > > > > > > > > > direct memory limit, I would also be interested why
> > Yang
> > > > Wang
> > > > > > > > thinks
> > > > > > > > > > > > leaving it open would be best. My concern about this
> > > would
> > > > be
> > > > > > > that
> > > > > > > > we
> > > > > > > > > > > would
> > > > > > > > > > > > be in a similar situation as we are now with the
> > > > > > > > RocksDBStateBackend.
> > > > > > > > > > If
> > > > > > > > > > > > the different memory pools are not clearly separated
> > and
> > > > can
> > > > > > > spill
> > > > > > > > > over
> > > > > > > > > > > to
> > > > > > > > > > > > a different pool, then it is quite hard to understand
> > > what
> > > > > > > exactly
> > > > > > > > > > > causes a
> > > > > > > > > > > > process to get killed for using too much memory. This
> > > could
> > > > > > then
> > > > > > > > > easily
> > > > > > > > > > > > lead to a similar situation what we have with the
> > > > > cutoff-ratio.
> > > > > > > So
> > > > > > > > > why
> > > > > > > > > > > not
> > > > > > > > > > > > setting a sane default value for max direct memory
> and
> > > > giving
> > > > > > the
> > > > > > > > > user
> > > > > > > > > > an
> > > > > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > > > > >
> > > > > > > > > > > > @Xintong, how would alternative 2 lead to lower
> memory
> > > > > > > utilization
> > > > > > > > > than
> > > > > > > > > > > > alternative 3 where we set the direct memory to a
> > higher
> > > > > value?
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Till
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > > > > [hidden email]
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regarding your comments:
> > > > > > > > > > > > >
> > > > > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > > > > I think setting a very large max direct memory size
> > > > > > definitely
> > > > > > > > has
> > > > > > > > > > some
> > > > > > > > > > > > > good sides. E.g., we do not worry about direct OOM,
> > and
> > > > we
> > > > > > > don't
> > > > > > > > > even
> > > > > > > > > > > > need
> > > > > > > > > > > > > to allocate managed / network memory with
> > > > > Unsafe.allocate() .
> > > > > > > > > > > > > However, there are also some down sides of doing
> > this.
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - One thing I can think of is that if a task
> > > executor
> > > > > > > > container
> > > > > > > > > is
> > > > > > > > > > > > >    killed due to overusing memory, it could be hard
> > for
> > > > use
> > > > > > to
> > > > > > > > know
> > > > > > > > > > > which
> > > > > > > > > > > > > part
> > > > > > > > > > > > >    of the memory is overused.
> > > > > > > > > > > > >    - Another down side is that the JVM never
> trigger
> > GC
> > > > due
> > > > > > to
> > > > > > > > > > reaching
> > > > > > > > > > > > max
> > > > > > > > > > > > >    direct memory limit, because the limit is too
> high
> > > to
> > > > be
> > > > > > > > > reached.
> > > > > > > > > > > That
> > > > > > > > > > > > >    means we kind of relay on heap memory to trigger
> > GC
> > > > and
> > > > > > > > release
> > > > > > > > > > > direct
> > > > > > > > > > > > >    memory. That could be a problem in cases where
> we
> > > have
> > > > > > more
> > > > > > > > > direct
> > > > > > > > > > > > > memory
> > > > > > > > > > > > >    usage but not enough heap activity to trigger
> the
> > > GC.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Maybe you can share your reasons for preferring
> > > setting a
> > > > > > very
> > > > > > > > > large
> > > > > > > > > > > > value,
> > > > > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > > > > >
> > > > > > > > > > > > > *Memory Calculation*
> > > > > > > > > > > > > If there is any conflict between multiple
> > configuration
> > > > > that
> > > > > > > user
> > > > > > > > > > > > > explicitly specified, I think we should throw an
> > error.
> > > > > > > > > > > > > I think doing checking on the client side is a good
> > > idea,
> > > > > so
> > > > > > > that
> > > > > > > > > on
> > > > > > > > > > > > Yarn /
> > > > > > > > > > > > > K8s we can discover the problem before submitting
> the
> > > > Flink
> > > > > > > > > cluster,
> > > > > > > > > > > > which
> > > > > > > > > > > > > is always a good thing.
> > > > > > > > > > > > > But we can not only rely on the client side
> checking,
> > > > > because
> > > > > > > for
> > > > > > > > > > > > > standalone cluster TaskManagers on different
> machines
> > > may
> > > > > > have
> > > > > > > > > > > different
> > > > > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > > > > What do you think?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi xintong,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for your detailed proposal. After all the
> > > memory
> > > > > > > > > > configuration
> > > > > > > > > > > > are
> > > > > > > > > > > > > > introduced, it will be more powerful to control
> the
> > > > flink
> > > > > > > > memory
> > > > > > > > > > > > usage. I
> > > > > > > > > > > > > > just have few questions about it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We do not differentiate user direct memory and
> > native
> > > > > > memory.
> > > > > > > > > They
> > > > > > > > > > > are
> > > > > > > > > > > > > all
> > > > > > > > > > > > > > included in task off-heap memory. Right? So i
> don’t
> > > > think
> > > > > > we
> > > > > > > > > could
> > > > > > > > > > > not
> > > > > > > > > > > > > set
> > > > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer
> > > leaving
> > > > > it a
> > > > > > > > very
> > > > > > > > > > > large
> > > > > > > > > > > > > > value.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If the sum of and fine-grained memory(network
> > memory,
> > > > > > managed
> > > > > > > > > > memory,
> > > > > > > > > > > > > etc.)
> > > > > > > > > > > > > > is larger than total process memory, how do we
> deal
> > > > with
> > > > > > this
> > > > > > > > > > > > situation?
> > > > > > > > > > > > > Do
> > > > > > > > > > > > > > we need to check the memory configuration in
> > client?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Xintong Song <[hidden email]>
> 于2019年8月7日周三
> > > > > > 下午10:14写道:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We would like to start a discussion thread on
> > > > "FLIP-49:
> > > > > > > > Unified
> > > > > > > > > > > > Memory
> > > > > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> > > > describe
> > > > > > how
> > > > > > > to
> > > > > > > > > > > improve
> > > > > > > > > > > > > > > TaskExecutor memory configurations. The FLIP
> > > document
> > > > > is
> > > > > > > > mostly
> > > > > > > > > > > based
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > early design "Memory Management and
> Configuration
> > > > > > > > Reloaded"[2]
> > > > > > > > > by
> > > > > > > > > > > > > > Stephan,
> > > > > > > > > > > > > > > with updates from follow-up discussions both
> > online
> > > > and
> > > > > > > > > offline.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This FLIP addresses several shortcomings of
> > current
> > > > > > (Flink
> > > > > > > > 1.9)
> > > > > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - Different configuration for Streaming and
> > > Batch.
> > > > > > > > > > > > > > >    - Complex and difficult configuration of
> > RocksDB
> > > > in
> > > > > > > > > Streaming.
> > > > > > > > > > > > > > >    - Complicated, uncertain and hard to
> > understand.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Key changes to solve the problems can be
> > summarized
> > > > as
> > > > > > > > follows.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - Extend memory manager to also account for
> > > memory
> > > > > > usage
> > > > > > > > by
> > > > > > > > > > > state
> > > > > > > > > > > > > > >    backends.
> > > > > > > > > > > > > > >    - Modify how TaskExecutor memory is
> > partitioned
> > > > > > > accounted
> > > > > > > > > > > > individual
> > > > > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > > > > >    - Simplify memory configuration options and
> > > > > > calculations
> > > > > > > > > > logics.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Please find more details in the FLIP wiki
> > document
> > > > [1].
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > (Please note that the early design doc [2] is
> out
> > > of
> > > > > > sync,
> > > > > > > > and
> > > > > > > > > it
> > > > > > > > > > > is
> > > > > > > > > > > > > > > appreciated to have the discussion in this
> > mailing
> > > > list
> > > > > > > > > thread.)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [2]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Till Rohrmann
Hi Xintong,

thanks for addressing the comments and adding a more detailed
implementation plan. I have a couple of comments concerning the
implementation plan:

- The name `TaskExecutorSpecifics` is not really descriptive. Choosing a
different name could help here.
- I'm not sure whether I would pass the memory configuration to the
TaskExecutor via environment variables. I think it would be better to write
it into the configuration one uses to start the TM process.
- If possible, I would exclude the memory reservation from this FLIP and
add this as part of a dedicated FLIP.
- If possible, then I would exclude changes to the network stack from this
FLIP. Maybe we can simply say that the direct memory needed by the network
stack is the framework direct memory requirement. Changing how the memory
is allocated can happen in a second step. This would keep the scope of this
FLIP smaller.

Cheers,
Till

On Thu, Aug 22, 2019 at 2:51 PM Xintong Song <[hidden email]> wrote:

> Hi everyone,
>
> I just updated the FLIP document on wiki [1], with the following changes.
>
>    - Removed open question regarding MemorySegment allocation. As
>    discussed, we exclude this topic from the scope of this FLIP.
>    - Updated content about JVM direct memory parameter according to recent
>    discussions, and moved the other options to "Rejected Alternatives" for
> the
>    moment.
>    - Added implementation steps.
>
>
> Thank you~
>
> Xintong Song
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>
> On Mon, Aug 19, 2019 at 7:16 PM Stephan Ewen <[hidden email]> wrote:
>
> > @Xintong: Concerning "wait for memory users before task dispose and
> memory
> > release": I agree, that's how it should be. Let's try it out.
> >
> > @Xintong @Jingsong: Concerning " JVM does not wait for GC when allocating
> > direct memory buffer": There seems to be pretty elaborate logic to free
> > buffers when allocating new ones. See
> >
> >
> http://hg.openjdk.java.net/jdk8u/jdk8u-dev/jdk/file/tip/src/share/classes/java/nio/Bits.java#l643
> >
> > @Till: Maybe. If we assume that the JVM default works (like going with
> > option 2 and not setting "-XX:MaxDirectMemorySize" at all), then I think
> it
> > should be okay to set "-XX:MaxDirectMemorySize" to
> > "off_heap_managed_memory + direct_memory" even if we use RocksDB. That
> is a
> > big if, though, I honestly have no idea :D Would be good to understand
> > this, though, because this would affect option (2) and option (1.2).
> >
> > On Mon, Aug 19, 2019 at 4:44 PM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Thanks for the inputs, Jingsong.
> > >
> > > Let me try to summarize your points. Please correct me if I'm wrong.
> > >
> > >    - Memory consumers should always avoid returning memory segments to
> > >    memory manager while there are still un-cleaned structures / threads
> > > that
> > >    may use the memory. Otherwise, it would cause serious problems by
> > having
> > >    multiple consumers trying to use the same memory segment.
> > >    - JVM does not wait for GC when allocating direct memory buffer.
> > >    Therefore even we set proper max direct memory size limit, we may
> > still
> > >    encounter direct memory oom if the GC cleaning memory slower than
> the
> > >    direct memory allocation.
> > >
> > > Am I understanding this correctly?
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Mon, Aug 19, 2019 at 4:21 PM JingsongLee <[hidden email]
> > > .invalid>
> > > wrote:
> > >
> > > > Hi stephan:
> > > >
> > > > About option 2:
> > > >
> > > > if additional threads not cleanly shut down before we can exit the
> > task:
> > > > In the current case of memory reuse, it has freed up the memory it
> > > >  uses. If this memory is used by other tasks and asynchronous threads
> > > >  of exited task may still be writing, there will be concurrent
> security
> > > >  problems, and even lead to errors in user computing results.
> > > >
> > > > So I think this is a serious and intolerable bug, No matter what the
> > > >  option is, it should be avoided.
> > > >
> > > > About direct memory cleaned by GC:
> > > > I don't think it is a good idea, I've encountered so many situations
> > > >  that it's too late for GC to cause DirectMemory OOM. Release and
> > > >  allocate DirectMemory depend on the type of user job, which is
> > > >  often beyond our control.
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > >
> > > > ------------------------------------------------------------------
> > > > From:Stephan Ewen <[hidden email]>
> > > > Send Time:2019年8月19日(星期一) 15:56
> > > > To:dev <[hidden email]>
> > > > Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for
> > > > TaskExecutors
> > > >
> > > > My main concern with option 2 (manually release memory) is that
> > segfaults
> > > > in the JVM send off all sorts of alarms on user ends. So we need to
> > > > guarantee that this never happens.
> > > >
> > > > The trickyness is in tasks that uses data structures / algorithms
> with
> > > > additional threads, like hash table spill/read and sorting threads.
> We
> > > need
> > > > to ensure that these cleanly shut down before we can exit the task.
> > > > I am not sure that we have that guaranteed already, that's why option
> > 1.1
> > > > seemed simpler to me.
> > > >
> > > > On Mon, Aug 19, 2019 at 3:42 PM Xintong Song <[hidden email]>
> > > > wrote:
> > > >
> > > > > Thanks for the comments, Stephan. Summarized in this way really
> makes
> > > > > things easier to understand.
> > > > >
> > > > > I'm in favor of option 2, at least for the moment. I think it is
> not
> > > that
> > > > > difficult to keep it segfault safe for memory manager, as long as
> we
> > > > always
> > > > > de-allocate the memory segment when it is released from the memory
> > > > > consumers. Only if the memory consumer continue using the buffer of
> > > > memory
> > > > > segment after releasing it, in which case we do want the job to
> fail
> > so
> > > > we
> > > > > detect the memory leak early.
> > > > >
> > > > > For option 1.2, I don't think this is a good idea. Not only because
> > the
> > > > > assumption (regular GC is enough to clean direct buffers) may not
> > > always
> > > > be
> > > > > true, but also it makes harder for finding problems in cases of
> > memory
> > > > > overuse. E.g., user configured some direct memory for the user
> > > libraries.
> > > > > If the library actually use more direct memory then configured,
> which
> > > > > cannot be cleaned by GC because they are still in use, may lead to
> > > > overuse
> > > > > of the total container memory. In that case, if it didn't touch the
> > JVM
> > > > > default max direct memory limit, we cannot get a direct memory OOM
> > and
> > > it
> > > > > will become super hard to understand which part of the
> configuration
> > > need
> > > > > to be updated.
> > > > >
> > > > > For option 1.1, it has the similar problem as 1.2, if the exceeded
> > > direct
> > > > > memory does not reach the max direct memory limit specified by the
> > > > > dedicated parameter. I think it is slightly better than 1.2, only
> > > because
> > > > > we can tune the parameter.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen <[hidden email]>
> > wrote:
> > > > >
> > > > > > About the "-XX:MaxDirectMemorySize" discussion, maybe let me
> > > summarize
> > > > > it a
> > > > > > bit differently:
> > > > > >
> > > > > > We have the following two options:
> > > > > >
> > > > > > (1) We let MemorySegments be de-allocated by the GC. That makes
> it
> > > > > segfault
> > > > > > safe. But then we need a way to trigger GC in case de-allocation
> > and
> > > > > > re-allocation of a bunch of segments happens quickly, which is
> > often
> > > > the
> > > > > > case during batch scheduling or task restart.
> > > > > >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do
> > this
> > > > > >   - Another way could be to have a dedicated bookkeeping in the
> > > > > > MemoryManager (option 1.2), so that this is a number independent
> of
> > > the
> > > > > > "-XX:MaxDirectMemorySize" parameter.
> > > > > >
> > > > > > (2) We manually allocate and de-allocate the memory for the
> > > > > MemorySegments
> > > > > > (option 2). That way we need not worry about triggering GC by
> some
> > > > > > threshold or bookkeeping, but it is harder to prevent segfaults.
> We
> > > > need
> > > > > to
> > > > > > be very careful about when we release the memory segments (only
> in
> > > the
> > > > > > cleanup phase of the main thread).
> > > > > >
> > > > > > If we go with option 1.1, we probably need to set
> > > > > > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> > > direct_memory"
> > > > > and
> > > > > > have "direct_memory" as a separate reserved memory pool. Because
> if
> > > we
> > > > > just
> > > > > > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> > > > > jvm_overhead",
> > > > > > then there will be times when that entire memory is allocated by
> > > direct
> > > > > > buffers and we have nothing left for the JVM overhead. So we
> either
> > > > need
> > > > > a
> > > > > > way to compensate for that (again some safety margin cutoff
> value)
> > or
> > > > we
> > > > > > will exceed container memory.
> > > > > >
> > > > > > If we go with option 1.2, we need to be aware that it takes
> > elaborate
> > > > > logic
> > > > > > to push recycling of direct buffers without always triggering a
> > full
> > > > GC.
> > > > > >
> > > > > >
> > > > > > My first guess is that the options will be easiest to do in the
> > > > following
> > > > > > order:
> > > > > >
> > > > > >   - Option 1.1 with a dedicated direct_memory parameter, as
> > discussed
> > > > > > above. We would need to find a way to set the direct_memory
> > parameter
> > > > by
> > > > > > default. We could start with 64 MB and see how it goes in
> practice.
> > > One
> > > > > > danger I see is that setting this loo low can cause a bunch of
> > > > additional
> > > > > > GCs compared to before (we need to watch this carefully).
> > > > > >
> > > > > >   - Option 2. It is actually quite simple to implement, we could
> > try
> > > > how
> > > > > > segfault safe we are at the moment.
> > > > > >
> > > > > >   - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize"
> > > > > parameter
> > > > > > at all and assume that all the direct memory allocations that the
> > JVM
> > > > and
> > > > > > Netty do are infrequent enough to be cleaned up fast enough
> through
> > > > > regular
> > > > > > GC. I am not sure if that is a valid assumption, though.
> > > > > >
> > > > > > Best,
> > > > > > Stephan
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <
> > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for sharing your opinion Till.
> > > > > > >
> > > > > > > I'm also in favor of alternative 2. I was wondering whether we
> > can
> > > > > avoid
> > > > > > > using Unsafe.allocate() for off-heap managed memory and network
> > > > memory
> > > > > > with
> > > > > > > alternative 3. But after giving it a second thought, I think
> even
> > > for
> > > > > > > alternative 3 using direct memory for off-heap managed memory
> > could
> > > > > cause
> > > > > > > problems.
> > > > > > >
> > > > > > > Hi Yang,
> > > > > > >
> > > > > > > Regarding your concern, I think what proposed in this FLIP it
> to
> > > have
> > > > > > both
> > > > > > > off-heap managed memory and network memory allocated through
> > > > > > > Unsafe.allocate(), which means they are practically native
> memory
> > > and
> > > > > not
> > > > > > > limited by JVM max direct memory. The only parts of memory
> > limited
> > > by
> > > > > JVM
> > > > > > > max direct memory are task off-heap memory and JVM overhead,
> > which
> > > > are
> > > > > > > exactly alternative 2 suggests to set the JVM max direct memory
> > to.
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the clarification Xintong. I understand the two
> > > > > alternatives
> > > > > > > > now.
> > > > > > > >
> > > > > > > > I would be in favour of option 2 because it makes things
> > > explicit.
> > > > If
> > > > > > we
> > > > > > > > don't limit the direct memory, I fear that we might end up
> in a
> > > > > similar
> > > > > > > > situation as we are currently in: The user might see that her
> > > > process
> > > > > > > gets
> > > > > > > > killed by the OS and does not know why this is the case.
> > > > > Consequently,
> > > > > > > she
> > > > > > > > tries to decrease the process memory size (similar to
> > increasing
> > > > the
> > > > > > > cutoff
> > > > > > > > ratio) in order to accommodate for the extra direct memory.
> > Even
> > > > > worse,
> > > > > > > she
> > > > > > > > tries to decrease memory budgets which are not fully used and
> > > hence
> > > > > > won't
> > > > > > > > change the overall memory consumption.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Till
> > > > > > > >
> > > > > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <
> > > > [hidden email]
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Let me explain this with a concrete example Till.
> > > > > > > > >
> > > > > > > > > Let's say we have the following scenario.
> > > > > > > > >
> > > > > > > > > Total Process Memory: 1GB
> > > > > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead):
> > 200MB
> > > > > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap
> > Managed
> > > > > Memory
> > > > > > > and
> > > > > > > > > Network Memory): 800MB
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very
> > > large
> > > > > > > value,
> > > > > > > > > let's say 1TB.
> > > > > > > > >
> > > > > > > > > If the actual direct memory usage of Task Off-Heap Memory
> and
> > > JVM
> > > > > > > > Overhead
> > > > > > > > > do not exceed 200MB, then alternative 2 and alternative 3
> > > should
> > > > > have
> > > > > > > the
> > > > > > > > > same utility. Setting larger -XX:MaxDirectMemorySize will
> not
> > > > > reduce
> > > > > > > the
> > > > > > > > > sizes of the other memory pools.
> > > > > > > > >
> > > > > > > > > If the actual direct memory usage of Task Off-Heap Memory
> and
> > > JVM
> > > > > > > > > Overhead potentially exceed 200MB, then
> > > > > > > > >
> > > > > > > > >    - Alternative 2 suffers from frequent OOM. To avoid
> that,
> > > the
> > > > > only
> > > > > > > > thing
> > > > > > > > >    user can do is to modify the configuration and increase
> > JVM
> > > > > Direct
> > > > > > > > > Memory
> > > > > > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that
> user
> > > > > > increases
> > > > > > > > JVM
> > > > > > > > >    Direct Memory to 250MB, this will reduce the total size
> of
> > > > other
> > > > > > > > memory
> > > > > > > > >    pools to 750MB, given the total process memory remains
> > 1GB.
> > > > > > > > >    - For alternative 3, there is no chance of direct OOM.
> > There
> > > > are
> > > > > > > > chances
> > > > > > > > >    of exceeding the total process memory limit, but given
> > that
> > > > the
> > > > > > > > process
> > > > > > > > > may
> > > > > > > > >    not use up all the reserved native memory (Off-Heap
> > Managed
> > > > > > Memory,
> > > > > > > > > Network
> > > > > > > > >    Memory, JVM Metaspace), if the actual direct memory
> usage
> > is
> > > > > > > slightly
> > > > > > > > > above
> > > > > > > > >    yet very close to 200MB, user probably do not need to
> > change
> > > > the
> > > > > > > > >    configurations.
> > > > > > > > >
> > > > > > > > > Therefore, I think from the user's perspective, a feasible
> > > > > > > configuration
> > > > > > > > > for alternative 2 may lead to lower resource utilization
> > > compared
> > > > > to
> > > > > > > > > alternative 3.
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> > > > > [hidden email]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I guess you have to help me understand the difference
> > between
> > > > > > > > > alternative 2
> > > > > > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > > > > > >
> > > > > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task
> > Off-Heap
> > > > > Memory
> > > > > > > and
> > > > > > > > > JVM
> > > > > > > > > > Overhead. Then there is the risk that this size is too
> low
> > > > > > resulting
> > > > > > > > in a
> > > > > > > > > > lot of garbage collection and potentially an OOM.
> > > > > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something
> > > larger
> > > > > > than
> > > > > > > > > > alternative 2. This would of course reduce the sizes of
> the
> > > > other
> > > > > > > > memory
> > > > > > > > > > types.
> > > > > > > > > >
> > > > > > > > > > How would alternative 2 now result in an under
> utilization
> > of
> > > > > > memory
> > > > > > > > > > compared to alternative 3? If alternative 3 strictly
> sets a
> > > > > higher
> > > > > > > max
> > > > > > > > > > direct memory size and we use only little, then I would
> > > expect
> > > > > that
> > > > > > > > > > alternative 3 results in memory under utilization.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Till
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <
> > > > [hidden email]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi xintong,till
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Native and Direct Memory
> > > > > > > > > > >
> > > > > > > > > > > My point is setting a very large max direct memory size
> > > when
> > > > we
> > > > > > do
> > > > > > > > not
> > > > > > > > > > > differentiate direct and native memory. If the direct
> > > > > > > > memory,including
> > > > > > > > > > user
> > > > > > > > > > > direct memory and framework direct memory,could be
> > > calculated
> > > > > > > > > > > correctly,then
> > > > > > > > > > > i am in favor of setting direct memory with fixed
> value.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Memory Calculation
> > > > > > > > > > >
> > > > > > > > > > > I agree with xintong. For Yarn and k8s,we need to check
> > the
> > > > > > memory
> > > > > > > > > > > configurations in client to avoid submitting
> successfully
> > > and
> > > > > > > failing
> > > > > > > > > in
> > > > > > > > > > > the flink master.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > >
> > > > > > > > > > > Yang
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song <[hidden email]>于2019年8月13日
> > 周二22:07写道:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for replying, Till.
> > > > > > > > > > > >
> > > > > > > > > > > > About MemorySegment, I think you are right that we
> > should
> > > > not
> > > > > > > > include
> > > > > > > > > > > this
> > > > > > > > > > > > issue in the scope of this FLIP. This FLIP should
> > > > concentrate
> > > > > > on
> > > > > > > > how
> > > > > > > > > to
> > > > > > > > > > > > configure memory pools for TaskExecutors, with
> minimum
> > > > > > > involvement
> > > > > > > > on
> > > > > > > > > > how
> > > > > > > > > > > > memory consumers use it.
> > > > > > > > > > > >
> > > > > > > > > > > > About direct memory, I think alternative 3 may not
> > having
> > > > the
> > > > > > > same
> > > > > > > > > over
> > > > > > > > > > > > reservation issue that alternative 2 does, but at the
> > > cost
> > > > of
> > > > > > > risk
> > > > > > > > of
> > > > > > > > > > > over
> > > > > > > > > > > > using memory at the container level, which is not
> good.
> > > My
> > > > > > point
> > > > > > > is
> > > > > > > > > > that
> > > > > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are
> not
> > > easy
> > > > > to
> > > > > > > > > config.
> > > > > > > > > > > For
> > > > > > > > > > > > alternative 2, users might configure them higher than
> > > what
> > > > > > > actually
> > > > > > > > > > > needed,
> > > > > > > > > > > > just to avoid getting a direct OOM. For alternative
> 3,
> > > > users
> > > > > do
> > > > > > > not
> > > > > > > > > get
> > > > > > > > > > > > direct OOM, so they may not config the two options
> > > > > aggressively
> > > > > > > > high.
> > > > > > > > > > But
> > > > > > > > > > > > the consequences are risks of overall container
> memory
> > > > usage
> > > > > > > > exceeds
> > > > > > > > > > the
> > > > > > > > > > > > budget.
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > > > > > [hidden email]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > > > > > >
> > > > > > > > > > > > > All in all I think it already looks quite good.
> > > > Concerning
> > > > > > the
> > > > > > > > > first
> > > > > > > > > > > open
> > > > > > > > > > > > > question about allocating memory segments, I was
> > > > wondering
> > > > > > > > whether
> > > > > > > > > > this
> > > > > > > > > > > > is
> > > > > > > > > > > > > strictly necessary to do in the context of this
> FLIP
> > or
> > > > > > whether
> > > > > > > > > this
> > > > > > > > > > > > could
> > > > > > > > > > > > > be done as a follow up? Without knowing all
> details,
> > I
> > > > > would
> > > > > > be
> > > > > > > > > > > concerned
> > > > > > > > > > > > > that we would widen the scope of this FLIP too much
> > > > because
> > > > > > we
> > > > > > > > > would
> > > > > > > > > > > have
> > > > > > > > > > > > > to touch all the existing call sites of the
> > > MemoryManager
> > > > > > where
> > > > > > > > we
> > > > > > > > > > > > allocate
> > > > > > > > > > > > > memory segments (this should mainly be batch
> > > operators).
> > > > > The
> > > > > > > > > addition
> > > > > > > > > > > of
> > > > > > > > > > > > > the memory reservation call to the MemoryManager
> > should
> > > > not
> > > > > > be
> > > > > > > > > > affected
> > > > > > > > > > > > by
> > > > > > > > > > > > > this and I would hope that this is the only point
> of
> > > > > > > interaction
> > > > > > > > a
> > > > > > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Concerning the second open question about setting
> or
> > > not
> > > > > > > setting
> > > > > > > > a
> > > > > > > > > > max
> > > > > > > > > > > > > direct memory limit, I would also be interested why
> > > Yang
> > > > > Wang
> > > > > > > > > thinks
> > > > > > > > > > > > > leaving it open would be best. My concern about
> this
> > > > would
> > > > > be
> > > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > > would
> > > > > > > > > > > > > be in a similar situation as we are now with the
> > > > > > > > > RocksDBStateBackend.
> > > > > > > > > > > If
> > > > > > > > > > > > > the different memory pools are not clearly
> separated
> > > and
> > > > > can
> > > > > > > > spill
> > > > > > > > > > over
> > > > > > > > > > > > to
> > > > > > > > > > > > > a different pool, then it is quite hard to
> understand
> > > > what
> > > > > > > > exactly
> > > > > > > > > > > > causes a
> > > > > > > > > > > > > process to get killed for using too much memory.
> This
> > > > could
> > > > > > > then
> > > > > > > > > > easily
> > > > > > > > > > > > > lead to a similar situation what we have with the
> > > > > > cutoff-ratio.
> > > > > > > > So
> > > > > > > > > > why
> > > > > > > > > > > > not
> > > > > > > > > > > > > setting a sane default value for max direct memory
> > and
> > > > > giving
> > > > > > > the
> > > > > > > > > > user
> > > > > > > > > > > an
> > > > > > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > > > > > >
> > > > > > > > > > > > > @Xintong, how would alternative 2 lead to lower
> > memory
> > > > > > > > utilization
> > > > > > > > > > than
> > > > > > > > > > > > > alternative 3 where we set the direct memory to a
> > > higher
> > > > > > value?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Till
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > > > > > [hidden email]
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regarding your comments:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > > > > > I think setting a very large max direct memory
> size
> > > > > > > definitely
> > > > > > > > > has
> > > > > > > > > > > some
> > > > > > > > > > > > > > good sides. E.g., we do not worry about direct
> OOM,
> > > and
> > > > > we
> > > > > > > > don't
> > > > > > > > > > even
> > > > > > > > > > > > > need
> > > > > > > > > > > > > > to allocate managed / network memory with
> > > > > > Unsafe.allocate() .
> > > > > > > > > > > > > > However, there are also some down sides of doing
> > > this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - One thing I can think of is that if a task
> > > > executor
> > > > > > > > > container
> > > > > > > > > > is
> > > > > > > > > > > > > >    killed due to overusing memory, it could be
> hard
> > > for
> > > > > use
> > > > > > > to
> > > > > > > > > know
> > > > > > > > > > > > which
> > > > > > > > > > > > > > part
> > > > > > > > > > > > > >    of the memory is overused.
> > > > > > > > > > > > > >    - Another down side is that the JVM never
> > trigger
> > > GC
> > > > > due
> > > > > > > to
> > > > > > > > > > > reaching
> > > > > > > > > > > > > max
> > > > > > > > > > > > > >    direct memory limit, because the limit is too
> > high
> > > > to
> > > > > be
> > > > > > > > > > reached.
> > > > > > > > > > > > That
> > > > > > > > > > > > > >    means we kind of relay on heap memory to
> trigger
> > > GC
> > > > > and
> > > > > > > > > release
> > > > > > > > > > > > direct
> > > > > > > > > > > > > >    memory. That could be a problem in cases where
> > we
> > > > have
> > > > > > > more
> > > > > > > > > > direct
> > > > > > > > > > > > > > memory
> > > > > > > > > > > > > >    usage but not enough heap activity to trigger
> > the
> > > > GC.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Maybe you can share your reasons for preferring
> > > > setting a
> > > > > > > very
> > > > > > > > > > large
> > > > > > > > > > > > > value,
> > > > > > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Memory Calculation*
> > > > > > > > > > > > > > If there is any conflict between multiple
> > > configuration
> > > > > > that
> > > > > > > > user
> > > > > > > > > > > > > > explicitly specified, I think we should throw an
> > > error.
> > > > > > > > > > > > > > I think doing checking on the client side is a
> good
> > > > idea,
> > > > > > so
> > > > > > > > that
> > > > > > > > > > on
> > > > > > > > > > > > > Yarn /
> > > > > > > > > > > > > > K8s we can discover the problem before submitting
> > the
> > > > > Flink
> > > > > > > > > > cluster,
> > > > > > > > > > > > > which
> > > > > > > > > > > > > > is always a good thing.
> > > > > > > > > > > > > > But we can not only rely on the client side
> > checking,
> > > > > > because
> > > > > > > > for
> > > > > > > > > > > > > > standalone cluster TaskManagers on different
> > machines
> > > > may
> > > > > > > have
> > > > > > > > > > > > different
> > > > > > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > > > > > What do you think?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > > > > > [hidden email]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi xintong,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for your detailed proposal. After all
> the
> > > > memory
> > > > > > > > > > > configuration
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > introduced, it will be more powerful to control
> > the
> > > > > flink
> > > > > > > > > memory
> > > > > > > > > > > > > usage. I
> > > > > > > > > > > > > > > just have few questions about it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We do not differentiate user direct memory and
> > > native
> > > > > > > memory.
> > > > > > > > > > They
> > > > > > > > > > > > are
> > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > included in task off-heap memory. Right? So i
> > don’t
> > > > > think
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > > > not
> > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer
> > > > leaving
> > > > > > it a
> > > > > > > > > very
> > > > > > > > > > > > large
> > > > > > > > > > > > > > > value.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If the sum of and fine-grained memory(network
> > > memory,
> > > > > > > managed
> > > > > > > > > > > memory,
> > > > > > > > > > > > > > etc.)
> > > > > > > > > > > > > > > is larger than total process memory, how do we
> > deal
> > > > > with
> > > > > > > this
> > > > > > > > > > > > > situation?
> > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > we need to check the memory configuration in
> > > client?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Xintong Song <[hidden email]>
> > 于2019年8月7日周三
> > > > > > > 下午10:14写道:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We would like to start a discussion thread on
> > > > > "FLIP-49:
> > > > > > > > > Unified
> > > > > > > > > > > > > Memory
> > > > > > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> > > > > describe
> > > > > > > how
> > > > > > > > to
> > > > > > > > > > > > improve
> > > > > > > > > > > > > > > > TaskExecutor memory configurations. The FLIP
> > > > document
> > > > > > is
> > > > > > > > > mostly
> > > > > > > > > > > > based
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > early design "Memory Management and
> > Configuration
> > > > > > > > > Reloaded"[2]
> > > > > > > > > > by
> > > > > > > > > > > > > > > Stephan,
> > > > > > > > > > > > > > > > with updates from follow-up discussions both
> > > online
> > > > > and
> > > > > > > > > > offline.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This FLIP addresses several shortcomings of
> > > current
> > > > > > > (Flink
> > > > > > > > > 1.9)
> > > > > > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >    - Different configuration for Streaming
> and
> > > > Batch.
> > > > > > > > > > > > > > > >    - Complex and difficult configuration of
> > > RocksDB
> > > > > in
> > > > > > > > > > Streaming.
> > > > > > > > > > > > > > > >    - Complicated, uncertain and hard to
> > > understand.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Key changes to solve the problems can be
> > > summarized
> > > > > as
> > > > > > > > > follows.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >    - Extend memory manager to also account
> for
> > > > memory
> > > > > > > usage
> > > > > > > > > by
> > > > > > > > > > > > state
> > > > > > > > > > > > > > > >    backends.
> > > > > > > > > > > > > > > >    - Modify how TaskExecutor memory is
> > > partitioned
> > > > > > > > accounted
> > > > > > > > > > > > > individual
> > > > > > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > > > > > >    - Simplify memory configuration options
> and
> > > > > > > calculations
> > > > > > > > > > > logics.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Please find more details in the FLIP wiki
> > > document
> > > > > [1].
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > (Please note that the early design doc [2] is
> > out
> > > > of
> > > > > > > sync,
> > > > > > > > > and
> > > > > > > > > > it
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > > appreciated to have the discussion in this
> > > mailing
> > > > > list
> > > > > > > > > > thread.)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [2]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <
> > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for sharing your opinion Till.
> > > > > > >
> > > > > > > I'm also in favor of alternative 2. I was wondering whether we
> > can
> > > > > avoid
> > > > > > > using Unsafe.allocate() for off-heap managed memory and network
> > > > memory
> > > > > > with
> > > > > > > alternative 3. But after giving it a second thought, I think
> even
> > > for
> > > > > > > alternative 3 using direct memory for off-heap managed memory
> > could
> > > > > cause
> > > > > > > problems.
> > > > > > >
> > > > > > > Hi Yang,
> > > > > > >
> > > > > > > Regarding your concern, I think what proposed in this FLIP it
> to
> > > have
> > > > > > both
> > > > > > > off-heap managed memory and network memory allocated through
> > > > > > > Unsafe.allocate(), which means they are practically native
> memory
> > > and
> > > > > not
> > > > > > > limited by JVM max direct memory. The only parts of memory
> > limited
> > > by
> > > > > JVM
> > > > > > > max direct memory are task off-heap memory and JVM overhead,
> > which
> > > > are
> > > > > > > exactly alternative 2 suggests to set the JVM max direct memory
> > to.
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the clarification Xintong. I understand the two
> > > > > alternatives
> > > > > > > > now.
> > > > > > > >
> > > > > > > > I would be in favour of option 2 because it makes things
> > > explicit.
> > > > If
> > > > > > we
> > > > > > > > don't limit the direct memory, I fear that we might end up
> in a
> > > > > similar
> > > > > > > > situation as we are currently in: The user might see that her
> > > > process
> > > > > > > gets
> > > > > > > > killed by the OS and does not know why this is the case.
> > > > > Consequently,
> > > > > > > she
> > > > > > > > tries to decrease the process memory size (similar to
> > increasing
> > > > the
> > > > > > > cutoff
> > > > > > > > ratio) in order to accommodate for the extra direct memory.
> > Even
> > > > > worse,
> > > > > > > she
> > > > > > > > tries to decrease memory budgets which are not fully used and
> > > hence
> > > > > > won't
> > > > > > > > change the overall memory consumption.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Till
> > > > > > > >
> > > > > > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <
> > > > [hidden email]
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Let me explain this with a concrete example Till.
> > > > > > > > >
> > > > > > > > > Let's say we have the following scenario.
> > > > > > > > >
> > > > > > > > > Total Process Memory: 1GB
> > > > > > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead):
> > 200MB
> > > > > > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap
> > Managed
> > > > > Memory
> > > > > > > and
> > > > > > > > > Network Memory): 800MB
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very
> > > large
> > > > > > > value,
> > > > > > > > > let's say 1TB.
> > > > > > > > >
> > > > > > > > > If the actual direct memory usage of Task Off-Heap Memory
> and
> > > JVM
> > > > > > > > Overhead
> > > > > > > > > do not exceed 200MB, then alternative 2 and alternative 3
> > > should
> > > > > have
> > > > > > > the
> > > > > > > > > same utility. Setting larger -XX:MaxDirectMemorySize will
> not
> > > > > reduce
> > > > > > > the
> > > > > > > > > sizes of the other memory pools.
> > > > > > > > >
> > > > > > > > > If the actual direct memory usage of Task Off-Heap Memory
> and
> > > JVM
> > > > > > > > > Overhead potentially exceed 200MB, then
> > > > > > > > >
> > > > > > > > >    - Alternative 2 suffers from frequent OOM. To avoid
> that,
> > > the
> > > > > only
> > > > > > > > thing
> > > > > > > > >    user can do is to modify the configuration and increase
> > JVM
> > > > > Direct
> > > > > > > > > Memory
> > > > > > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that
> user
> > > > > > increases
> > > > > > > > JVM
> > > > > > > > >    Direct Memory to 250MB, this will reduce the total size
> of
> > > > other
> > > > > > > > memory
> > > > > > > > >    pools to 750MB, given the total process memory remains
> > 1GB.
> > > > > > > > >    - For alternative 3, there is no chance of direct OOM.
> > There
> > > > are
> > > > > > > > chances
> > > > > > > > >    of exceeding the total process memory limit, but given
> > that
> > > > the
> > > > > > > > process
> > > > > > > > > may
> > > > > > > > >    not use up all the reserved native memory (Off-Heap
> > Managed
> > > > > > Memory,
> > > > > > > > > Network
> > > > > > > > >    Memory, JVM Metaspace), if the actual direct memory
> usage
> > is
> > > > > > > slightly
> > > > > > > > > above
> > > > > > > > >    yet very close to 200MB, user probably do not need to
> > change
> > > > the
> > > > > > > > >    configurations.
> > > > > > > > >
> > > > > > > > > Therefore, I think from the user's perspective, a feasible
> > > > > > > configuration
> > > > > > > > > for alternative 2 may lead to lower resource utilization
> > > compared
> > > > > to
> > > > > > > > > alternative 3.
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> > > > > [hidden email]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I guess you have to help me understand the difference
> > between
> > > > > > > > > alternative 2
> > > > > > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > > > > > >
> > > > > > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task
> > Off-Heap
> > > > > Memory
> > > > > > > and
> > > > > > > > > JVM
> > > > > > > > > > Overhead. Then there is the risk that this size is too
> low
> > > > > > resulting
> > > > > > > > in a
> > > > > > > > > > lot of garbage collection and potentially an OOM.
> > > > > > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something
> > > larger
> > > > > > than
> > > > > > > > > > alternative 2. This would of course reduce the sizes of
> the
> > > > other
> > > > > > > > memory
> > > > > > > > > > types.
> > > > > > > > > >
> > > > > > > > > > How would alternative 2 now result in an under
> utilization
> > of
> > > > > > memory
> > > > > > > > > > compared to alternative 3? If alternative 3 strictly
> sets a
> > > > > higher
> > > > > > > max
> > > > > > > > > > direct memory size and we use only little, then I would
> > > expect
> > > > > that
> > > > > > > > > > alternative 3 results in memory under utilization.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Till
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <
> > > > [hidden email]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi xintong,till
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Native and Direct Memory
> > > > > > > > > > >
> > > > > > > > > > > My point is setting a very large max direct memory size
> > > when
> > > > we
> > > > > > do
> > > > > > > > not
> > > > > > > > > > > differentiate direct and native memory. If the direct
> > > > > > > > memory,including
> > > > > > > > > > user
> > > > > > > > > > > direct memory and framework direct memory,could be
> > > calculated
> > > > > > > > > > > correctly,then
> > > > > > > > > > > i am in favor of setting direct memory with fixed
> value.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Memory Calculation
> > > > > > > > > > >
> > > > > > > > > > > I agree with xintong. For Yarn and k8s,we need to check
> > the
> > > > > > memory
> > > > > > > > > > > configurations in client to avoid submitting
> successfully
> > > and
> > > > > > > failing
> > > > > > > > > in
> > > > > > > > > > > the flink master.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > >
> > > > > > > > > > > Yang
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song <[hidden email]>于2019年8月13日
> > 周二22:07写道:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for replying, Till.
> > > > > > > > > > > >
> > > > > > > > > > > > About MemorySegment, I think you are right that we
> > should
> > > > not
> > > > > > > > include
> > > > > > > > > > > this
> > > > > > > > > > > > issue in the scope of this FLIP. This FLIP should
> > > > concentrate
> > > > > > on
> > > > > > > > how
> > > > > > > > > to
> > > > > > > > > > > > configure memory pools for TaskExecutors, with
> minimum
> > > > > > > involvement
> > > > > > > > on
> > > > > > > > > > how
> > > > > > > > > > > > memory consumers use it.
> > > > > > > > > > > >
> > > > > > > > > > > > About direct memory, I think alternative 3 may not
> > having
> > > > the
> > > > > > > same
> > > > > > > > > over
> > > > > > > > > > > > reservation issue that alternative 2 does, but at the
> > > cost
> > > > of
> > > > > > > risk
> > > > > > > > of
> > > > > > > > > > > over
> > > > > > > > > > > > using memory at the container level, which is not
> good.
> > > My
> > > > > > point
> > > > > > > is
> > > > > > > > > > that
> > > > > > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are
> not
> > > easy
> > > > > to
> > > > > > > > > config.
> > > > > > > > > > > For
> > > > > > > > > > > > alternative 2, users might configure them higher than
> > > what
> > > > > > > actually
> > > > > > > > > > > needed,
> > > > > > > > > > > > just to avoid getting a direct OOM. For alternative
> 3,
> > > > users
> > > > > do
> > > > > > > not
> > > > > > > > > get
> > > > > > > > > > > > direct OOM, so they may not config the two options
> > > > > aggressively
> > > > > > > > high.
> > > > > > > > > > But
> > > > > > > > > > > > the consequences are risks of overall container
> memory
> > > > usage
> > > > > > > > exceeds
> > > > > > > > > > the
> > > > > > > > > > > > budget.
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > > > > > [hidden email]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > > > > > >
> > > > > > > > > > > > > All in all I think it already looks quite good.
> > > > Concerning
> > > > > > the
> > > > > > > > > first
> > > > > > > > > > > open
> > > > > > > > > > > > > question about allocating memory segments, I was
> > > > wondering
> > > > > > > > whether
> > > > > > > > > > this
> > > > > > > > > > > > is
> > > > > > > > > > > > > strictly necessary to do in the context of this
> FLIP
> > or
> > > > > > whether
> > > > > > > > > this
> > > > > > > > > > > > could
> > > > > > > > > > > > > be done as a follow up? Without knowing all
> details,
> > I
> > > > > would
> > > > > > be
> > > > > > > > > > > concerned
> > > > > > > > > > > > > that we would widen the scope of this FLIP too much
> > > > because
> > > > > > we
> > > > > > > > > would
> > > > > > > > > > > have
> > > > > > > > > > > > > to touch all the existing call sites of the
> > > MemoryManager
> > > > > > where
> > > > > > > > we
> > > > > > > > > > > > allocate
> > > > > > > > > > > > > memory segments (this should mainly be batch
> > > operators).
> > > > > The
> > > > > > > > > addition
> > > > > > > > > > > of
> > > > > > > > > > > > > the memory reservation call to the MemoryManager
> > should
> > > > not
> > > > > > be
> > > > > > > > > > affected
> > > > > > > > > > > > by
> > > > > > > > > > > > > this and I would hope that this is the only point
> of
> > > > > > > interaction
> > > > > > > > a
> > > > > > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Concerning the second open question about setting
> or
> > > not
> > > > > > > setting
> > > > > > > > a
> > > > > > > > > > max
> > > > > > > > > > > > > direct memory limit, I would also be interested why
> > > Yang
> > > > > Wang
> > > > > > > > > thinks
> > > > > > > > > > > > > leaving it open would be best. My concern about
> this
> > > > would
> > > > > be
> > > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > > would
> > > > > > > > > > > > > be in a similar situation as we are now with the
> > > > > > > > > RocksDBStateBackend.
> > > > > > > > > > > If
> > > > > > > > > > > > > the different memory pools are not clearly
> separated
> > > and
> > > > > can
> > > > > > > > spill
> > > > > > > > > > over
> > > > > > > > > > > > to
> > > > > > > > > > > > > a different pool, then it is quite hard to
> understand
> > > > what
> > > > > > > > exactly
> > > > > > > > > > > > causes a
> > > > > > > > > > > > > process to get killed for using too much memory.
> This
> > > > could
> > > > > > > then
> > > > > > > > > > easily
> > > > > > > > > > > > > lead to a similar situation what we have with the
> > > > > > cutoff-ratio.
> > > > > > > > So
> > > > > > > > > > why
> > > > > > > > > > > > not
> > > > > > > > > > > > > setting a sane default value for max direct memory
> > and
> > > > > giving
> > > > > > > the
> > > > > > > > > > user
> > > > > > > > > > > an
> > > > > > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > > > > > >
> > > > > > > > > > > > > @Xintong, how would alternative 2 lead to lower
> > memory
> > > > > > > > utilization
> > > > > > > > > > than
> > > > > > > > > > > > > alternative 3 where we set the direct memory to a
> > > higher
> > > > > > value?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Till
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > > > > > [hidden email]
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regarding your comments:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > > > > > I think setting a very large max direct memory
> size
> > > > > > > definitely
> > > > > > > > > has
> > > > > > > > > > > some
> > > > > > > > > > > > > > good sides. E.g., we do not worry about direct
> OOM,
> > > and
> > > > > we
> > > > > > > > don't
> > > > > > > > > > even
> > > > > > > > > > > > > need
> > > > > > > > > > > > > > to allocate managed / network memory with
> > > > > > Unsafe.allocate() .
> > > > > > > > > > > > > > However, there are also some down sides of doing
> > > this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - One thing I can think of is that if a task
> > > > executor
> > > > > > > > > container
> > > > > > > > > > is
> > > > > > > > > > > > > >    killed due to overusing memory, it could be
> hard
> > > for
> > > > > use
> > > > > > > to
> > > > > > > > > know
> > > > > > > > > > > > which
> > > > > > > > > > > > > > part
> > > > > > > > > > > > > >    of the memory is overused.
> > > > > > > > > > > > > >    - Another down side is that the JVM never
> > trigger
> > > GC
> > > > > due
> > > > > > > to
> > > > > > > > > > > reaching
> > > > > > > > > > > > > max
> > > > > > > > > > > > > >    direct memory limit, because the limit is too
> > high
> > > > to
> > > > > be
> > > > > > > > > > reached.
> > > > > > > > > > > > That
> > > > > > > > > > > > > >    means we kind of relay on heap memory to
> trigger
> > > GC
> > > > > and
> > > > > > > > > release
> > > > > > > > > > > > direct
> > > > > > > > > > > > > >    memory. That could be a problem in cases where
> > we
> > > > have
> > > > > > > more
> > > > > > > > > > direct
> > > > > > > > > > > > > > memory
> > > > > > > > > > > > > >    usage but not enough heap activity to trigger
> > the
> > > > GC.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Maybe you can share your reasons for preferring
> > > > setting a
> > > > > > > very
> > > > > > > > > > large
> > > > > > > > > > > > > value,
> > > > > > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Memory Calculation*
> > > > > > > > > > > > > > If there is any conflict between multiple
> > > configuration
> > > > > > that
> > > > > > > > user
> > > > > > > > > > > > > > explicitly specified, I think we should throw an
> > > error.
> > > > > > > > > > > > > > I think doing checking on the client side is a
> good
> > > > idea,
> > > > > > so
> > > > > > > > that
> > > > > > > > > > on
> > > > > > > > > > > > > Yarn /
> > > > > > > > > > > > > > K8s we can discover the problem before submitting
> > the
> > > > > Flink
> > > > > > > > > > cluster,
> > > > > > > > > > > > > which
> > > > > > > > > > > > > > is always a good thing.
> > > > > > > > > > > > > > But we can not only rely on the client side
> > checking,
> > > > > > because
> > > > > > > > for
> > > > > > > > > > > > > > standalone cluster TaskManagers on different
> > machines
> > > > may
> > > > > > > have
> > > > > > > > > > > > different
> > > > > > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > > > > > What do you think?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > > > > > [hidden email]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi xintong,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for your detailed proposal. After all
> the
> > > > memory
> > > > > > > > > > > configuration
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > introduced, it will be more powerful to control
> > the
> > > > > flink
> > > > > > > > > memory
> > > > > > > > > > > > > usage. I
> > > > > > > > > > > > > > > just have few questions about it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We do not differentiate user direct memory and
> > > native
> > > > > > > memory.
> > > > > > > > > > They
> > > > > > > > > > > > are
> > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > included in task off-heap memory. Right? So i
> > don’t
> > > > > think
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > > > not
> > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer
> > > > leaving
> > > > > > it a
> > > > > > > > > very
> > > > > > > > > > > > large
> > > > > > > > > > > > > > > value.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If the sum of and fine-grained memory(network
> > > memory,
> > > > > > > managed
> > > > > > > > > > > memory,
> > > > > > > > > > > > > > etc.)
> > > > > > > > > > > > > > > is larger than total process memory, how do we
> > deal
> > > > > with
> > > > > > > this
> > > > > > > > > > > > > situation?
> > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > we need to check the memory configuration in
> > > client?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Xintong Song <[hidden email]>
> > 于2019年8月7日周三
> > > > > > > 下午10:14写道:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We would like to start a discussion thread on
> > > > > "FLIP-49:
> > > > > > > > > Unified
> > > > > > > > > > > > > Memory
> > > > > > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> > > > > describe
> > > > > > > how
> > > > > > > > to
> > > > > > > > > > > > improve
> > > > > > > > > > > > > > > > TaskExecutor memory configurations. The FLIP
> > > > document
> > > > > > is
> > > > > > > > > mostly
> > > > > > > > > > > > based
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > early design "Memory Management and
> > Configuration
> > > > > > > > > Reloaded"[2]
> > > > > > > > > > by
> > > > > > > > > > > > > > > Stephan,
> > > > > > > > > > > > > > > > with updates from follow-up discussions both
> > > online
> > > > > and
> > > > > > > > > > offline.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This FLIP addresses several shortcomings of
> > > current
> > > > > > > (Flink
> > > > > > > > > 1.9)
> > > > > > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >    - Different configuration for Streaming
> and
> > > > Batch.
> > > > > > > > > > > > > > > >    - Complex and difficult configuration of
> > > RocksDB
> > > > > in
> > > > > > > > > > Streaming.
> > > > > > > > > > > > > > > >    - Complicated, uncertain and hard to
> > > understand.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Key changes to solve the problems can be
> > > summarized
> > > > > as
> > > > > > > > > follows.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >    - Extend memory manager to also account
> for
> > > > memory
> > > > > > > usage
> > > > > > > > > by
> > > > > > > > > > > > state
> > > > > > > > > > > > > > > >    backends.
> > > > > > > > > > > > > > > >    - Modify how TaskExecutor memory is
> > > partitioned
> > > > > > > > accounted
> > > > > > > > > > > > > individual
> > > > > > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > > > > > >    - Simplify memory configuration options
> and
> > > > > > > calculations
> > > > > > > > > > > logics.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Please find more details in the FLIP wiki
> > > document
> > > > > [1].
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > (Please note that the early design doc [2] is
> > out
> > > > of
> > > > > > > sync,
> > > > > > > > > and
> > > > > > > > > > it
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > > appreciated to have the discussion in this
> > > mailing
> > > > > list
> > > > > > > > > > thread.)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [2]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Xintong Song
Thanks for the comments, Till.

I've also seen your comments on the wiki page, but let's keep the
discussion here.

- Regarding 'TaskExecutorSpecifics', how do you think about naming it
'TaskExecutorResourceSpecifics'.
- Regarding passing memory configurations into task executors, I'm in favor
of do it via environment variables rather than configurations, with the
following two reasons.
  - It is easier to keep the memory options once calculate not to be
changed with environment variables rather than configurations.
  - I'm not sure whether we should write the configuration in startup
scripts. Writing changes into the configuration files when running the
startup scripts does not sounds right to me. Or we could make a copy of
configuration files per flink cluster, and make the task executor to load
from the copy, and clean up the copy after the cluster is shutdown, which
is complicated. (I think this is also what Stephan means in his comment on
the wiki page?)
- Regarding reserving memory, I think this change should be included in
this FLIP. I think a big part of motivations of this FLIP is to unify
memory configuration for streaming / batch and make it easy for configuring
rocksdb memory. If we don't support memory reservation, then streaming jobs
cannot use managed memory (neither on-heap or off-heap), which makes this
FLIP incomplete.
- Regarding network memory, I think you are right. I think we probably
don't need to change network stack from using direct memory to using unsafe
native memory. Network memory size is deterministic, cannot be reserved as
managed memory does, and cannot be overused. I think it also works if we
simply keep using direct memory for network and include it in jvm max
direct memory size.

Thank you~

Xintong Song



On Tue, Aug 27, 2019 at 8:12 PM Till Rohrmann <[hidden email]> wrote:

> Hi Xintong,
>
> thanks for addressing the comments and adding a more detailed
> implementation plan. I have a couple of comments concerning the
> implementation plan:
>
> - The name `TaskExecutorSpecifics` is not really descriptive. Choosing a
> different name could help here.
> - I'm not sure whether I would pass the memory configuration to the
> TaskExecutor via environment variables. I think it would be better to write
> it into the configuration one uses to start the TM process.
> - If possible, I would exclude the memory reservation from this FLIP and
> add this as part of a dedicated FLIP.
> - If possible, then I would exclude changes to the network stack from this
> FLIP. Maybe we can simply say that the direct memory needed by the network
> stack is the framework direct memory requirement. Changing how the memory
> is allocated can happen in a second step. This would keep the scope of this
> FLIP smaller.
>
> Cheers,
> Till
>
> On Thu, Aug 22, 2019 at 2:51 PM Xintong Song <[hidden email]>
> wrote:
>
> > Hi everyone,
> >
> > I just updated the FLIP document on wiki [1], with the following changes.
> >
> >    - Removed open question regarding MemorySegment allocation. As
> >    discussed, we exclude this topic from the scope of this FLIP.
> >    - Updated content about JVM direct memory parameter according to
> recent
> >    discussions, and moved the other options to "Rejected Alternatives"
> for
> > the
> >    moment.
> >    - Added implementation steps.
> >
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> >
> > On Mon, Aug 19, 2019 at 7:16 PM Stephan Ewen <[hidden email]> wrote:
> >
> > > @Xintong: Concerning "wait for memory users before task dispose and
> > memory
> > > release": I agree, that's how it should be. Let's try it out.
> > >
> > > @Xintong @Jingsong: Concerning " JVM does not wait for GC when
> allocating
> > > direct memory buffer": There seems to be pretty elaborate logic to free
> > > buffers when allocating new ones. See
> > >
> > >
> >
> http://hg.openjdk.java.net/jdk8u/jdk8u-dev/jdk/file/tip/src/share/classes/java/nio/Bits.java#l643
> > >
> > > @Till: Maybe. If we assume that the JVM default works (like going with
> > > option 2 and not setting "-XX:MaxDirectMemorySize" at all), then I
> think
> > it
> > > should be okay to set "-XX:MaxDirectMemorySize" to
> > > "off_heap_managed_memory + direct_memory" even if we use RocksDB. That
> > is a
> > > big if, though, I honestly have no idea :D Would be good to understand
> > > this, though, because this would affect option (2) and option (1.2).
> > >
> > > On Mon, Aug 19, 2019 at 4:44 PM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for the inputs, Jingsong.
> > > >
> > > > Let me try to summarize your points. Please correct me if I'm wrong.
> > > >
> > > >    - Memory consumers should always avoid returning memory segments
> to
> > > >    memory manager while there are still un-cleaned structures /
> threads
> > > > that
> > > >    may use the memory. Otherwise, it would cause serious problems by
> > > having
> > > >    multiple consumers trying to use the same memory segment.
> > > >    - JVM does not wait for GC when allocating direct memory buffer.
> > > >    Therefore even we set proper max direct memory size limit, we may
> > > still
> > > >    encounter direct memory oom if the GC cleaning memory slower than
> > the
> > > >    direct memory allocation.
> > > >
> > > > Am I understanding this correctly?
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Mon, Aug 19, 2019 at 4:21 PM JingsongLee <[hidden email]
> > > > .invalid>
> > > > wrote:
> > > >
> > > > > Hi stephan:
> > > > >
> > > > > About option 2:
> > > > >
> > > > > if additional threads not cleanly shut down before we can exit the
> > > task:
> > > > > In the current case of memory reuse, it has freed up the memory it
> > > > >  uses. If this memory is used by other tasks and asynchronous
> threads
> > > > >  of exited task may still be writing, there will be concurrent
> > security
> > > > >  problems, and even lead to errors in user computing results.
> > > > >
> > > > > So I think this is a serious and intolerable bug, No matter what
> the
> > > > >  option is, it should be avoided.
> > > > >
> > > > > About direct memory cleaned by GC:
> > > > > I don't think it is a good idea, I've encountered so many
> situations
> > > > >  that it's too late for GC to cause DirectMemory OOM. Release and
> > > > >  allocate DirectMemory depend on the type of user job, which is
> > > > >  often beyond our control.
> > > > >
> > > > > Best,
> > > > > Jingsong Lee
> > > > >
> > > > >
> > > > > ------------------------------------------------------------------
> > > > > From:Stephan Ewen <[hidden email]>
> > > > > Send Time:2019年8月19日(星期一) 15:56
> > > > > To:dev <[hidden email]>
> > > > > Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for
> > > > > TaskExecutors
> > > > >
> > > > > My main concern with option 2 (manually release memory) is that
> > > segfaults
> > > > > in the JVM send off all sorts of alarms on user ends. So we need to
> > > > > guarantee that this never happens.
> > > > >
> > > > > The trickyness is in tasks that uses data structures / algorithms
> > with
> > > > > additional threads, like hash table spill/read and sorting threads.
> > We
> > > > need
> > > > > to ensure that these cleanly shut down before we can exit the task.
> > > > > I am not sure that we have that guaranteed already, that's why
> option
> > > 1.1
> > > > > seemed simpler to me.
> > > > >
> > > > > On Mon, Aug 19, 2019 at 3:42 PM Xintong Song <
> [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Thanks for the comments, Stephan. Summarized in this way really
> > makes
> > > > > > things easier to understand.
> > > > > >
> > > > > > I'm in favor of option 2, at least for the moment. I think it is
> > not
> > > > that
> > > > > > difficult to keep it segfault safe for memory manager, as long as
> > we
> > > > > always
> > > > > > de-allocate the memory segment when it is released from the
> memory
> > > > > > consumers. Only if the memory consumer continue using the buffer
> of
> > > > > memory
> > > > > > segment after releasing it, in which case we do want the job to
> > fail
> > > so
> > > > > we
> > > > > > detect the memory leak early.
> > > > > >
> > > > > > For option 1.2, I don't think this is a good idea. Not only
> because
> > > the
> > > > > > assumption (regular GC is enough to clean direct buffers) may not
> > > > always
> > > > > be
> > > > > > true, but also it makes harder for finding problems in cases of
> > > memory
> > > > > > overuse. E.g., user configured some direct memory for the user
> > > > libraries.
> > > > > > If the library actually use more direct memory then configured,
> > which
> > > > > > cannot be cleaned by GC because they are still in use, may lead
> to
> > > > > overuse
> > > > > > of the total container memory. In that case, if it didn't touch
> the
> > > JVM
> > > > > > default max direct memory limit, we cannot get a direct memory
> OOM
> > > and
> > > > it
> > > > > > will become super hard to understand which part of the
> > configuration
> > > > need
> > > > > > to be updated.
> > > > > >
> > > > > > For option 1.1, it has the similar problem as 1.2, if the
> exceeded
> > > > direct
> > > > > > memory does not reach the max direct memory limit specified by
> the
> > > > > > dedicated parameter. I think it is slightly better than 1.2, only
> > > > because
> > > > > > we can tune the parameter.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen <[hidden email]>
> > > wrote:
> > > > > >
> > > > > > > About the "-XX:MaxDirectMemorySize" discussion, maybe let me
> > > > summarize
> > > > > > it a
> > > > > > > bit differently:
> > > > > > >
> > > > > > > We have the following two options:
> > > > > > >
> > > > > > > (1) We let MemorySegments be de-allocated by the GC. That makes
> > it
> > > > > > segfault
> > > > > > > safe. But then we need a way to trigger GC in case
> de-allocation
> > > and
> > > > > > > re-allocation of a bunch of segments happens quickly, which is
> > > often
> > > > > the
> > > > > > > case during batch scheduling or task restart.
> > > > > > >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do
> > > this
> > > > > > >   - Another way could be to have a dedicated bookkeeping in the
> > > > > > > MemoryManager (option 1.2), so that this is a number
> independent
> > of
> > > > the
> > > > > > > "-XX:MaxDirectMemorySize" parameter.
> > > > > > >
> > > > > > > (2) We manually allocate and de-allocate the memory for the
> > > > > > MemorySegments
> > > > > > > (option 2). That way we need not worry about triggering GC by
> > some
> > > > > > > threshold or bookkeeping, but it is harder to prevent
> segfaults.
> > We
> > > > > need
> > > > > > to
> > > > > > > be very careful about when we release the memory segments (only
> > in
> > > > the
> > > > > > > cleanup phase of the main thread).
> > > > > > >
> > > > > > > If we go with option 1.1, we probably need to set
> > > > > > > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> > > > direct_memory"
> > > > > > and
> > > > > > > have "direct_memory" as a separate reserved memory pool.
> Because
> > if
> > > > we
> > > > > > just
> > > > > > > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> > > > > > jvm_overhead",
> > > > > > > then there will be times when that entire memory is allocated
> by
> > > > direct
> > > > > > > buffers and we have nothing left for the JVM overhead. So we
> > either
> > > > > need
> > > > > > a
> > > > > > > way to compensate for that (again some safety margin cutoff
> > value)
> > > or
> > > > > we
> > > > > > > will exceed container memory.
> > > > > > >
> > > > > > > If we go with option 1.2, we need to be aware that it takes
> > > elaborate
> > > > > > logic
> > > > > > > to push recycling of direct buffers without always triggering a
> > > full
> > > > > GC.
> > > > > > >
> > > > > > >
> > > > > > > My first guess is that the options will be easiest to do in the
> > > > > following
> > > > > > > order:
> > > > > > >
> > > > > > >   - Option 1.1 with a dedicated direct_memory parameter, as
> > > discussed
> > > > > > > above. We would need to find a way to set the direct_memory
> > > parameter
> > > > > by
> > > > > > > default. We could start with 64 MB and see how it goes in
> > practice.
> > > > One
> > > > > > > danger I see is that setting this loo low can cause a bunch of
> > > > > additional
> > > > > > > GCs compared to before (we need to watch this carefully).
> > > > > > >
> > > > > > >   - Option 2. It is actually quite simple to implement, we
> could
> > > try
> > > > > how
> > > > > > > segfault safe we are at the moment.
> > > > > > >
> > > > > > >   - Option 1.2: We would not touch the
> "-XX:MaxDirectMemorySize"
> > > > > > parameter
> > > > > > > at all and assume that all the direct memory allocations that
> th