Google Summer of Code 2019

Name: Abhishek Dubey

Project Name: Optimizing the Pre-dump Algorithm

Organization: CRIU

Mentors:

Commit list (working repo): Github link

Official Documentation: Wiki page of product

Jump start

git clone git@github.com:dubeyabhishek/criu.git

cd criu/

make -j8

#run complete zdtm test-suite
sudo test/zdtm.py run -a --pre=<pre_dump_count> --ignore-taint --keep-going --pre-dump-mode=read

#run specific zdtm test
sudo python test/zdtm.py run --pre=<pre_dump_count> -t zdtm/<test_case_dir>/<test_case_name> --pre-dump-mode=read

The Project

I was involved in GSoC’19 with CRIU organization working on the project - Optimizing Pre-dump algorithm. The idea behind this project was to reduce memory pressure and frozen time of target process while pre-dumping.

The Idea

In current implementation of pre-dump in CRIU, first the memory content(pages) of target process are stored in pipes. Later the pipes are flushed to disk images or page-server. Pipes are backed by pipe-buffer pages to store data. The pipe-buffer pages are pinned to memory making them non swappable. When the count of pipe-buffer pages is high, the system is left with limited swappable pages to serve new memory requests. This incurs memory pressure on overall system and may thrash hampering the performance. So, handling memory pressure is first issue.

Another issue is duration for which current pre-dump algorithm keeps the target process frozen. The parasite(blob injected in target process) in current implementation drains the memory pages of target process into pipes and the target process remains frozen unless all pages are not drained. The longer the frozen time, longer the duration for which pipe pages will be pinned to memory leading to memory pressure situation. Another use case which desires shortest frozen time is live migration. So, another objective is to reduce frozen time of target process while pre-dumping.

The optimized implementation must solve above mentioned two issues. In new pre-dump approach, the target process is frozen only till memory mappings are collected. Then the process is unfrozen and it continues normally. Now start draining of process memory. We use process_vm_readv syscall to drain pages to user-space buffer. The syscall is given memory mappings(collected earlier) as input to perform this task. Since draining of memory pages and process execution happen simultaneously, there is a possibility that the running process might have modified some memory mappings after they have been collected by pre-dump. In such case process_vm_readv will encounter the old mappings and fails. This gives rise to a race between pre-dumping and process execution. This race needs to be handled on the fly for process_vm_readv to successfully drain complete memory.

Work done

Patch #	Title	Status
[PATCH 0/7]	GSoC 19: Optimizing the Pre-dump Algorithm	Submitted
[PATCH 1/7]	Adding --pre-dump-mode option	Submitted
[PATCH 2/7]	Skip generating iov for non-PROT_READ memory	Submitted
[PATCH 3/7]	Skip adding PROT_READ to non-PROT_READ mappings	Submitted
[PATCH 4/7]	Adding cnt_sub for stats manipulation	Submitted
[PATCH 5/7]	Handle vmsplice failure for read mode pre-dump	Submitted
[PATCH 6/7]	The read mode pre-dump implementation	Submitted
[PATCH 7/7]	Refactor time accounting macros	Submitted
[PATCH 8/7]	Added --pre-dump-mode to libcriu	Submitted

Interesting Issue

Handling the processing of iovs corresponding to modified memory mappings is most interesting issue of this project. Detailed discussion can be found here.

Evaluation

Performance

test-config: 1GB

# python test/zdtm.py run --pre <count> -t zdtm/static/maps04 --pre-dump-mode=<read/splice>

Pre-dump #	splice (original)	read (optimized)
1	0.59*	0.66*
2	0.06	0.06
3	0.07	0.06
4	0.06	0.06
5	0.06	0.06
Total	0.84	0.90

Average performance drop : ~7%

* First pre-dump in sequence have ~13% performance drop.

Frozen Time

test-config: 1 GB

Pre-dump #	splice (original)	read (optimized)
1	125.13*	82.93*
2	76.28	65.94
3	78.05	69.52
4	77.80	69.58
5	74.61	65.31
Total	431.87	353.28

Average frozen time reduced : ~18%

* First pre-dump in sequence is ~35% faster.

Memory pressure

The new implementation reduces memory pressure on system, as pipes don't hog memory for longer duration.

Future work

Possible optimization of page-by-page processing when huge memory region is unmapped.
Develop new syscall by utilizing best of splice syscall and process_vm_readv syscall to avoid intermediate user-buffer. The new system call will directly splice memory pages to image files from the target process.

Acknowledgements

I am thankful to my mentors Pavel, Andrei and Mike for their guidance throughout the project. It was a great learning experience while working with them. I got chance to deep dive into memory draining part of CRIU and it was fun. Special thanks to Radostin for his prompt help and feedback. I would love to keep contributing to this wonderful community, called CRIU.

Conclusion

Overall, this year’s GSoC was a end-to-end learning experience for me and now I know "Open source" better than ever.