The eighth version of Intel compilers. Compilers for the Microsoft Windows platform

You are not a slave!
Closed educational course for children of the elite: "The true arrangement of the world."
http://noslave.org

From Wikipedia, the free encyclopedia

Intel C++ Compiler
Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).
Type
Author

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Developer
Developers

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Written in

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Interface

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

operating system
Interface languages

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

First edition

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Hardware platform
latest version
release candidate

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

beta version

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

alpha version

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Test version

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Readable file formats

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

Generated file formats

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

State

Lua error in Module:Wikidata on line 170: attempt to index field "wikibase" (a nil value).

License

Key features:

  • Vectorization for SSE , SSE2 , SSE3 , SSE4

The compiler supports the OpenMP 3.0 standard for writing parallel programs. It also contains a modification of OpenMP called Cluster OpenMP, with which you can run applications written according to OpenMP on clusters using MPI.

The Intel C++ Compiler uses a frontend (the part of the compiler that parses the program being compiled) from the Edison Design Group. The same frontend is used by the SGI MIPSpro, Comeau C++, Portland Group compilers.

This compiler is widely used for compiling SPEC CPU benchmarks.

There are 4 series of products from Intel containing the compiler:

  • Intel C++ Compiler Professional Edition
  • Intel Cluster Toolkit (Compiler Edition)

The disadvantages of the Linux version of the compiler include partial incompatibility with the GNU extensions of the C language (supported by the GCC compiler), which can cause problems when compiling some programs.

Experimental variants

The following experimental versions of the compiler have been published:

  • Intel STM Compiler Prototype Edition dated September 17, 2007. Software Transactional Memory (STM) support. Released for Linux and Windows, IA-32 (x86 processors) only;
  • Intel Concurrent Collections for C/C++ 0.3, September 2008. Contains mechanisms that facilitate writing parallel C++ programs.

Main flags

Windows Linux, Mac OS X Description
/od -O0 Disable optimizations
/O1 -O1 Optimize to minimize executable file size
/O2 -O2 Optimize for speed. Some optimizations included
/O3 -O3 Enable all optimizations from O2. Also perform intensive loop optimizations
/Oip -Oip Enable per-file interprocedural optimization
/Oipo -Oipo Enable global inter-procedural optimization
/QxO -xO Allow the use of SSE3, SSE2 and SSE extensions for processors manufactured by any company
/fast -fast "Quick mode". Equivalent to "/O3 /Qipo /QxHost /no-prec-div" on Windows and "-O3 -ipo -static -xHOST -no-prec-div" on Linux. Note that the "-xHOST" flag means optimization for the processor on which the compiler is running.
/Qprof-gen -prof_gen Create an instrumented version of the program that will assemble the execution profile
/Qprof-use -prof_use Take advantage of profile information from program launches compiled with the prof_gen flag.

Write a review on the article "Intel C++ compiler"

Notes

see also

Links

An excerpt characterizing the Intel C++ compiler

And yet, she returned in order to see the White Magus for the last time ... Her husband and truest friend, whom she could never forget. In her heart, she forgave him. But, to his great regret, she could not bring him the forgiveness of Magdalene .... So, as you see, Isidora, the great Christian fable about "forgiveness" is just a childish lie for naive believers to allow them to do any Evil, knowing that whatever they do, they will eventually be forgiven. But you can forgive only that which is truly worthy of forgiveness. A person must understand that he has to answer for any evil done... And not before some mysterious God, but before himself, forcing himself to suffer cruelly. Magdalena did not forgive Vladyka, although she deeply respected and sincerely loved him. Just as she failed to forgive all of us for the terrible death of Radomir. After all, it was SHE who understood best of all - we could help him, we could save him from a cruel death ... But we did not want to. Considering the guilt of the White Magus too cruel, she left him to live with this guilt, not for a moment forgetting it... She did not want to grant him an easy forgiveness. We never saw her again. As never saw their babies. Through one of the knights of her Temple - our sorcerer - Magdalena conveyed the answer to the Lord to his request to return to us: “The sun does not rise twice in one day ... The joy of your world (Radomir) will never return to you, just as I will not return to you and I... I found my FAITH and my TRUTH, they are LIVE, yours is DEAD... Mourn your sons - they loved you. I will never forgive you for their deaths as long as I live. And may your guilt remain with you. Perhaps someday she will bring you Light and Forgiveness ... But not from me. The head of Magus John was not brought to Meteora for the same reason - none of the Knights of the Temple wanted to return to us ... We lost them, as we lost many others more than once, who did not want to understand and accept our victims ... Who is it just like you - they left, condemning us.
I felt dizzy!.. As a thirsty one, satisfying my eternal hunger for knowledge, I greedily absorbed the flow of amazing information generously given by the North... And I wanted much more!.. I wanted to know everything to the end. It was a breath of fresh water in the desert scorched by pain and misfortune! And I couldn't drink enough...
I have a thousand questions! But there is no time left ... What should I do, Sever? ..
- Ask, Isidora!.. Ask, I will try to answer you...
- Tell me, Sever, why does it seem to me that in this story two stories of life, intertwined with similar events, are connected, and they are presented as the life of one person? Or am I not right?
– You are absolutely right, Isidora. As I told you earlier, the “powerful ones of this world”, who created a false history of mankind, “put” on the true life of Christ the alien life of the Jewish prophet Joshua, who lived one and a half thousand years ago (since the story of the North). And not only himself, but also his family, his relatives and friends, his friends and followers. After all, it was the wife of the prophet Joshua, the Jewish Mary, who had a sister Martha and a brother Lazarus, his mother's sister Maria Yakobe, and others who were never near Radomir and Magdalena. Just as there were no other "apostles" next to them - Paul, Matthew, Peter, Luke and the rest ...
It was the family of the prophet Joshua who moved one and a half thousand years ago to Provence (which at that time was called Gaul (Transalpine Gaul), to the Greek city of Massalia (now Marseille), since Massalia at that time was the “gateway” between Europe and Asia, and it was the easiest way for all the “persecuted” to avoid persecution and misfortune.

Introduction In late 2003, Intel introduced version 8.0 of its compiler collection. New compilers are designed to improve the performance of applications running on servers, desktops and mobile systems(laptops, mobile phones and PDAs) based on Intel processors. We are pleased to note that this product was created with the active participation of employees of the Nizhny Novgorod Intel Software Development Center and Intel specialists from Sarov.

The new series includes Intel compilers for C++ and Fortran for Windows and Linux, as well as Intel compilers for C++ for Windows CE .NET. The compilers target systems based on the following Intel processors: Intel Itanium 2, Intel Xeon, Intel Pentium 4, Intel Personal Internet Client Architecture processors for mobile phones and PDAs, and the Intel Pentium M processor for mobile PCs (a component of Intel Centrino technology for mobile phones). PC).

The Intel Visual Fortran Compiler for Windows provides next-generation compilation technologies for high-performance computing. It combines the functionality of the Compaq Visual Fortran (CVF) language with the performance improvements made possible by Intel's compilation and code generation optimization technologies and simplifies the task of porting source code, developed with CVF, into the Intel Visual Fortran environment. This compiler is the first to implement CVF functions for both 32-bit Intel systems and systems based on the Intel Itanium processor family running in Windows environment. In addition, this compiler allows you to implement CVF language functions on Linux systems based on 32-bit Intel processors and the Intel Itanium processor family. In 2004, it is planned to release an extended version of this compiler - the Intel Visual Fortran Compiler Professional Edition compiler for Windows, which will include the IMSL Fortran 5.0 Library developed by Visual Numerics, Inc.


"The new compilers also support Intel's upcoming processors, code-named Prescott, which provide new commands to improve graphics and video performance, as well as other performance enhancements. They also support new technology Mobile MMX(tm), which similarly improves the performance of graphics, sound and video applications for mobile phones and PDAs, - said Alexei Odinokov, co-director of the Intel Software Development Center in Nizhny Novgorod. - These compilers provide application developers with a single set of tools for building new applications for wireless networks based on Intel architecture. The new Intel compilers also support Intel's Hyper-Threading Technology and the OpenMP 2.0 industry specification, which defines the use of directives high level to control the flow of instructions in applications".

Among the new tools included in the compilers are the Intel Code Coverage and Intel Test Prioritization tools. Together, these tools help accelerate application development and improve application quality by improving the testing process. software.

The Code Coverage tool, when testing an application, provides complete details about the application's logic usage and the location of the areas used in the application's source code. If changes are made to the application or if this test does not allow checking the part of the application that is of interest to the developer, the Test Prioritization tool allows you to check the operation of the selected area program code.

The new Intel compilers come in a variety of configurations ranging from $399 to $1,499. They can be purchased today from Intel Corporation or from resellers around the world, a list of which is located on the site. http://www.intel.com/software/products/reseller.htm#Russia.

Support for Prescott processors

Support for the Intel Pentium 4 (Prescott) processor in the eighth version of the compiler is as follows:

1. Support for SSE3 commands (or PNI, Prescott New Instructions). There are three ways to highlight here:

A. Assembler inserts (Inline assembly). For example, the compiler will recognize the following use of the SSE3 command _asm(addsubpd xmm0, xmm1). Thus, users interested in low-level optimization can directly access the assembler commands.

b. In the C/C++ compiler, new instructions are available from a higher level than using assembler inserts. Namely, through built-in functions (intrinsic functions):

Built-in Functions

built-in functionGenerated command
_mm_addsub_psAddsubps
_mm_hadd_psHaddps
_mm_hsub_psMsubps
_mm_moveldup_psMovsldup
_mm_movehdup_psMovshdup
_mm_addsub_pdAddsubpd
_mm_hadd_pdHaddpd
_mm_hsub_pdhsubpd
_mm_loaddup_pdmovddup xmm, m64
_mm_movedup_pdmovddup reg, reg
_mm_lddqu_si128Lddqu

The table shows the built-in functions and corresponding assembler instructions from the SSE3 set. The same support exists for commands from the MMX\SSE\SSE2 sets. This allows the programmer to perform low-level code optimization without resorting to assembly language programming: the compiler itself takes care of mapping (mapping "e) the built-in functions to the corresponding processor instructions and the optimal use of registers. The programmer can concentrate on creating an algorithm that effectively uses new instruction sets.

V. Automatic generation of new commands by the compiler. The previous two methods involve the use of new commands by the programmer. But the compiler is also able (using appropriate options - see section 3 below) to automatically generate new instructions from the SSE3 set for C/C++ and Fortran code. For example, the optimized unaligned loading command (lddqu), which allows you to get a performance gain of up to 40% (for example, in video and audio coding tasks). Other commands from the SSE3 set allow you to get a significant acceleration in 3D graphics tasks or computational tasks using complex numbers. For example, the graph in section 3.1 below shows that for the 168.wupwise application from the SPEC CPU2000 FP suite, the speedup obtained from automatic generation of SSE3 instructions was ~25%. The performance of this application greatly depends on the speed of complex number arithmetic.

2. Using the microarchitectural advantages of the Prescott processor. When generating code, the compiler takes into account microarchitectural changes in the new processor. For example, some operations (such as integer shifts, integer multiplications, or number conversions between different floating point formats in SSE2) are faster on the new processor compared to previous versions (say, an integer shift now takes one processor cycle versus four for the previous version). Intel Pentium 4 processor). More intensive use of such commands allows you to get a significant acceleration of applications.
Another example of microarchitectural changes is the improved store forwarding mechanism (fast loading of data previously stored in memory); real saving does not even take place in the cache memory, but in some intermediate save buffer, which then allows for very fast access to the data. Such a feature of the architecture makes it possible, for example, to carry out more aggressive automatic vectorization of the program code.
The compiler also takes into account the increased amount of cache memory in the first and second levels.

3. Improved support for Hyper-Threading technology. This item may well be related to the previous one - microarchitectural changes and their use in the compiler. For example, a runtime library that supports the OpenMP industry specification has been optimized to run on the new processor.

Performance

Using compilers is an easy and efficient way to take advantage of Intel processor architectures. Below, two ways of using compilers are conditionally (very) highlighted: a) recompilation of programs with possible change compiler settings, b) recompilation with a change in both compiler settings and source text, as well as using compiler diagnostics for ongoing optimizations and the possible use of other software tools(for example, profilers).


1.1 Optimizing programs by recompiling and changing compiler settings


Often, the first step in migrating to a new optimizing compiler is to use it with the default settings. The next logical step is to use options for more aggressive optimization. Figures 1, 2, 3 and 4 show the effect of switching to the Intel compiler version 8.0 compared to using other industry-leading products (-O2 - default compiler settings, base - settings on maximum performance). The comparison is made on 32-bit and 64-bit Intel architectures. Applications from SPEC CPU2000 are used as a test set.


Picture 1




Figure 2




Figure 3




Figure 4


Some of the options are listed below (hereinafter, the options are for the Windows OS family; for the Linux OS family, there are options with the same effect, but the name may differ; for example, -Od or QxK for Windows have a similar effect with -O0 or -xK for Linux respectively, more detailed information can be found in the compiler manual) supported by the Intel compiler.


Optimization levels control: Options -Od (no optimizations; used for debugging programs), -O1 (maximum speed while minimizing code size), -O2 (optimization for code execution speed; used by default), -O3 (enables the most aggressive optimizations for code execution speed ; in some cases it can lead to the opposite effect, i.e. to a slowdown; it should be noted that on IA-64 the use of -O3 leads to acceleration in most cases, while the positive effect on IA-32 is less pronounced). Examples of optimizations enabled by -O3 are loop interchange, loop fusion, loop distribution (reverse loop fusion optimization), software prefetch of data. The reason why slowness is possible when using -O3 may be that the compiler used a heuristic approach to choose aggressive optimization for specific case, without having sufficient information about the program (for example, generated prefetch instructions for the data used in the loop, believing that the loop is executed a large number of times, when in fact it has only a few iterations). Interprocedural profiling optimization, as well as a variety of programmer "hints" (see Section 3.2) can help in this situation.

Interprocedural optimization: -Qip (within a single file) and -Qipo (within several or all project files). Includes such optimizations as, for example, inline substitution of frequently used code (reducing the cost of calling a function/procedure). Represents information to other stages of optimization - for example, information about the upper bound of the loop (say, if it is a compile-time constant defined in one file, but used in many) or information about data alignment in memory (many MMX\SSE\SSE2\SSE3 commands work faster if the operands are aligned in memory on an 8 or 16 byte boundary). The analysis of memory allocation procedures (implemented/called in one of the project files) is passed to those functions/procedures where this memory is used (this can help the compiler to abandon the conservative assumption that the data is not properly aligned in memory; and the assumption should be conservative when no additional information). Disambiguation, data aliasing analysis can serve as another example: in the absence of additional information and the impossibility of proving the absence of intersections, the compiler proceeds from the conservative assumption that there are intersections. Such a decision can negatively affect the quality of such optimizations as, for example, automatic vectorization on IA-32 or software pipelining (software pipelining or SWP) on IA-64. Interprocedural optimization can help in analyzing the presence of memory intersections.

Profiling Optimization: Includes three stages. 1) generating instrumented code using the -Qprof_gen option. 2) the resulting code is run on representative data, while running, information is collected about various characteristics of code execution (for example, transition probabilities or a typical value for the number of loop iterations). 3) Recompilation with the -Qprof_use option, which ensures that the compiler uses the information collected in the previous step. Thus, the compiler has the ability to use not only static estimates of important program characteristics, but also data obtained during a real run of the program. This can help with the subsequent choice of certain optimizations (for example, a more efficient arrangement in memory of various branches of the program, based on information about which branches were executed at what frequency; or applying an optimization to a loop based on information about the typical number of iterations in it) . Profiling optimization is especially useful when it is possible to select a small but representative data set (for step #2) that well illustrates the most typical future use cases of the program. In some subject areas, the choice of such a representative set is quite possible. For example, profiling optimization is used by DBMS developers.

The optimizations listed above are of the generic type, i.e. the generated code will work on all different processors of the family (say, in the case of a 32-bit architecture, on all of the following processors: Intel Pentium-III, Pentium 4, including the Prescott core, Intel Pentium M). There are also optimizations for a specific processor.

Processor specific optimizations: -QxK (Pentium-III; use of SSE commands, microarchitecture specifics), -QxW and -QxN (Pentium 4; use of SSE and SSE2 commands, microarchitecture specifics), -QxB (Pentium M; use of SSE and SSE2 commands, microarchitecture specifics) ), QxP (Prescott; use of SSE, SSE2, and SSE3 commands, microarchitecture features). In this case, the code generated using these options may not work on other representatives of the processor family (for example, -QxW code may result in the execution of an invalid command if it is executed on a system based on an Intel Pentium-III processor). Or work not with maximum efficiency (for example, -QxB code on a Pentium 4 processor due to differences in microarchitecture). With these options, it is also possible to use runtime libraries optimized for a specific processor using its instruction set. To control that the code is actually executed on the target processor, a dispatch mechanism (cpu-dispatch) is implemented: checking the processor during program execution. In various situations, this mechanism can either be activated or not. Dispatch is always used if the -Qax(KWNP) option variation is used. In this case, two versions of the code are generated: optimized for a specific processor and "general" (generic), the choice occurs during the execution of the program. Thus, by increasing the size of the code, it is possible to achieve program execution on all processors of the line and optimal execution on the target processor. Another option is to use code optimization for the previous representative of the line and use this code on this and subsequent processors. For example, -QxN code can run on a Pentium 4 with both Northwood and Prescott cores. There is no increase in code size. With this approach, you can get good, but still not optimal performance on a system with a Prescott processor (because SSE3 is not used and microarchitecture differences are not taken into account) with optimal performance on Northwood. Similar options also exist for IA-64 architecture processors. On this moment there are two of them: -G1 (Itanium) and -G2 (Itanium 2; default option).

The graph below (Figure 5) shows the speedup (based on one - no speedup) from using some of the optimizations listed above (namely -O3 -Qipo -Qprof_use -Qx(N,P)) on a Prescott processor compared with default settings (-O2). Using -QxP helps in some cases to get a speedup compared to -QxN. The greatest speedup is achieved in the 168.wupwise application already mentioned in the previous section (due to intensive optimization of complex arithmetic using SSE3 instructions).


Figure 5


Figure 6 below shows the ratio (in times) of the speed of the code with optimal settings compared to completely unoptimized code (-Od) on Pentium 4 and Itanium 2 processors. It can be seen that Itanium 2 depends much more on the quality of optimization. This is especially pronounced for floating point (FP) calculations, where the ratio is about 36 times. Floating point calculations are strong point IA-64 architectures, but care must be taken to use the most efficient compiler settings. The resulting gain in productivity pays for the labor spent on finding them.


Figure 6. Acceleration when using the best optimization options SPEC CPU200


Intel compilers support the OpenMP industry specification for building multi-threaded applications. Explicit (option -Qopenmp) and automatic (-Qparallel) parallelization are supported. In the case of explicit mode, the programmer is responsible for the correct and efficient use of the OpenMP standard. In the case of automatic parallelization, the compiler has an additional burden associated with the analysis of the program code. For this reason, at present, automatic parallelization works effectively only on fairly simple codes.

The graph in Figure 7 shows the acceleration from using explicit parallelization on an engineering (pre-production) sample system based on an Intel Pentium 4 processor (Prescott) with Hyper-Threading technology support: 2.8GHz, 2GB RAM, 8K L1-Cache, 512K L2-Cache . SPEC OMPM2001 is used as a test suite. This set focuses on small and medium SMP systems, memory consumption is up to two gigabytes. The applications were compiled using Intel 8.0 C/C++ and Fortran with two sets of options: -Qopenmp -Qipo -O3 -QxN and -Qopenmp -Qipo -O3 -QxP, with each of which the applications started with Hyper-Threading enabled and disabled. The acceleration values ​​on the graph are normalized to the performance of the single-threaded version with Hyper-Threading disabled.


Figure 7: Applications from the SPEC OMPM2001 suite on a Prescott processor


It can be seen that in 9 out of 11 cases, the use of explicit parallelization using OpenMP gives a performance boost when Hyper-Threading technology is enabled. One application (312.swim) is experiencing slowdowns. It is a known fact that this application is characterized by a high degree of dependence on bandwidth memory. As with the SPEC CPU2000, wupwise benefits greatly from Prescott optimizations (-QxP).


1.2 Optimizing programs with changes to the source code and using compiler diagnostics


In the previous sections, we considered the influence of the compiler (and its settings) on the speed of code execution. At the same time, Intel compilers provide more opportunities for code optimization than just changing settings. In particular, compilers allow the programmer to make "hints" in the program code, which allow the generation of more efficient code in terms of performance. Below are some examples for the C/C++ language (there are similar tools for the Fortran language, differing only in syntax).

#pragma ivdep (where ivdep means ignore vector dependencies) is used before program loops to tell the compiler that there are no data dependencies inside. This hint works when the compiler (based on the analysis) conservatively assumes that such dependencies can exist (if the compiler can prove that the dependency exists as a result of the analysis, then the "hint" has no effect), while the code author knows that such dependencies cannot arise. With this hint, the compiler can generate more efficient code: automatic vectorization for IA-32 (using vector instructions from the MMX\SSE\SSE2\SSE3 sets for C/C++ and Fortran program loops; you can learn more about this technique, for example, next article in the Intel Technology Journal), software pipelining (SWP) for IA-64.

#pragma vector always is used to force the compiler to change the decision about the inefficiency of loop vectorization (both automatic for IA-32 and SWP for IA-64), based on an analysis of the quantitative and qualitative characteristics of the work at each iteration.

#pragma novector does the opposite of #pragma vector always.

#pragma vector aligned is used to tell the compiler that the data used in the loop is aligned on a 16 byte boundary. This allows you to generate more efficient and/or compact (due to the lack of runtime checks) code.

#pragma vector unaligned does the opposite of #pragma aligned. It is difficult to talk about performance gains in this case, but you can count on a more compact code.

#pragma distribute point is used inside the program loop so that the compiler can split the distribution loop at this point into several smaller ones. For example, such a "hint" can be used when the compiler fails to automatically vectorize the source loop (for example, due to a data dependency that cannot be ignored even with #pragma ivdep), while each (or part) of the newly formed cycles can be efficiently vectorized.

#pragma loop count (N) is used to tell the compiler that the most likely value for the number of iterations of the loop will be N. This information helps to decide on the most effective optimization for this loop (for example, whether to unroll, whether to do SWP or automatic vectorization, whether to use software data prefetch commands, ...)

The "hint" _assume_aligned(p, base) is used to tell the compiler that the memory region associated with pointer p is aligned on a base = 2^n byte boundary.

This is far from full list various "hints" to the compiler, which can significantly affect the efficiency of the generated code. The question may arise as to how to determine that the compiler needs a hint.

First, you can use compiler diagnostics in the form of reports that it provides to the programmer. For example, using the -Qvec_reportN option (where N varies from 0 to 3 and represents the level of detail) you can get an automatic vectorization report. The programmer will have access to information about which loops have been vectorized and which have not. Otherwise, the compiler reports the reasons why the vectorization failed. Let's assume that the cause was a conservatively assumed dependence on the data. In this case, if the programmer is sure that the dependency cannot occur, then #pragma ivdep can be used. Compiler provides similar (comparing with Qvec_reportN for IA-32) capabilities on IA-64 to control the presence and effectiveness of SWP. In general, Intel compilers provide ample opportunities for diagnosing optimizations.

Second, other software products (such as the Intel VTune profiler) can be used to find performance bottlenecks in the code. The results of the analysis can help the programmer make the necessary changes.

You can also use the assembler code listing generated by the compiler for analysis.


Figure 8


Figure 8 above shows the step-by-step process of optimizing an application using a compiler (and other software products) Intel in Fortran language for IA-64 architecture. As an example, a non-adiabatic regional forecast scheme for 48 hours of the Roshydrometcenter is considered (you can read about it, for example, in this article. The article talks about the calculation time of about 25 minutes, but significant changes have occurred since it was written. Code performance is taken as a starting point on a Cray-YMP system Unmodified code with default compiler options (-O2) showed a performance gain of 20% on a 4-way system based on an Intel Itanium 2 900 MHz processor Applying more aggressive optimization (-O3) resulted in a ~2.5x speedup without changing the code mainly due to SWP and data prefetch Analysis using compiler diagnostics and Intel VTune profiler revealed some bottlenecks For example, the compiler did not programmatically pipeline several performance-critical loops, reporting in the report that it suggests data dependency .Small changes to the code (directive ivdep) helped to achieve the effect active conveying. Using the VTune profiler, it was found (and the compiler report confirmed this) that the compiler did not change the order of nested loops (loop interchange) for more efficient use of the cache memory. The reason was again conservative assumptions about the dependence on the data. Changes have been made in the source code of the program. As a result, we managed to achieve a 4-fold acceleration in relation to the initial version. Using explicit parallelization with OpenMP directives, and then moving to a system with more than high frequency allowed to reduce the calculation time to less than 8 minutes, which gave more than 16 times the speedup compared to the initial version.

Intel Visual Fortran

Intel Visual Fortran 8.0 uses the front-end (part of the compiler responsible for converting the program from text in the programming language to the internal representation of the compiler, which is largely independent of either the programming language or the target machine), CVF compiler technologies and components of the Intel compiler, responsible for a set of optimizations and code generation.


Figure 9




Figure 10


Figures 9 and 10 show comparison graphs Intel performance Visual Fortran 8.0 with the previous version of Intel Fortran 7.1 and with other industry-famous compilers from this language running under the OS Windows families and Linux. For comparison, tests were used, the source texts of which, meeting the F77 and F90 standards, are available at http://www.polyhedron.com/. On the same site, more detailed information on comparing compiler performance is available (Win32 Compiler Comparisons -> Fortran (77, 90) Execution Time Benchmarks and Linux Compiler Comparisons -> Fortran (77, 90) Execution Time Benchmarks): more different compilers are shown, and the geometric mean is given in conjunction with the individual results for each test.

In the previous issue of the magazine, we discussed products of the Intel VTune Performance Analyzer family - performance analysis tools that are well-deservedly popular with application developers and allow you to detect in the code team applications, which consumes too much processor resources, which gives developers the opportunity to identify and eliminate potential bottlenecks associated with such sections of code, thereby speeding up the application development process. Note, however, that the performance of applications largely depends on how efficient the compilers used in their development are, and what hardware features they use when generating machine code.

The latest Intel C++ and Intel Fortran compilers for Windows and Linux provide up to 40% performance gains in application performance for systems based on Intel Itanium 2, Intel Xeon, and Intel Pentium 4 processors over existing compilers from other vendors by using these features of these processors, such as Hyper-Threading technology.

The differences associated with code optimization by this family of compilers include the use of a stack to perform floating point operations, interprocedural optimization (Interprocedural Optimization, IPO), optimization in accordance with the application profile (Profile Guided Optimization, PGO), preloading data into the cache (Data prefetching), which avoids the delay associated with memory access, support for characteristic features of Intel processors (for example, extensions for streaming data processing Intel Streaming SIMD Extensions 2, specific to Intel Pentium 4), automatic parallelization of code execution, creation of applications, running on multiple different types processors when optimizing for one of them, means of "prediction" of the subsequent code (branch prediction), extended support for working with execution threads.

Note that Intel compilers are used in such well-known companies as Alias/Wavefront, Oracle, Fujitsu Siemens, ABAQUS, Silicon Graphics, IBM. Based on independent testing by a number of companies, the performance of Intel compilers significantly outperforms compilers from other manufacturers (see, for example, http://intel.com/software/products/compilers/techtopics/compiler_gnu_perf.pdf).

Below we will look at some of the features latest versions Intel compilers for desktop and server operating systems.

Compilers for the Microsoft Windows platform

Intel C++ Compiler 7.1 for Windows

Intel C++ Compiler 7.1 is a compiler released earlier this year that allows you to achieve high degree code optimizations for the Intel Itanium, Intel Itanium 2, Intel Pentium 4, and Intel Xeon processors, as well as for the Intel Pentium M processor using Intel technology Centrino and designed for use in mobile devices.

The specified compiler is fully compatible with Microsoft Visual C++ 6.0 development tools and Microsoft Visual Studio .NET: It can be built into appropriate development environments.

This compiler supports ANSI and ISO C/C++ standards.

Intel Fortran Compiler 7.1 for Windows

The Intel Fortran Compiler 7.1 for Windows, also released earlier this year, allows you to create optimized code for Intel Itanium, Intel Itanium 2, Intel Pentium 4 and Intel Xeon, Intel Pentium M processors.

This compiler is fully compatible with Microsoft Visual C++ 6.0 and Microsoft Visual Studio .NET development tools, that is, it can be integrated into the corresponding development environments. In addition, this compiler allows you to develop 64-bit applications for operating systems running on Itanium / Itanium 2 processors, with help from Microsoft Visual Studio on a 32-bit Pentium processor using the 64-bit Intel Fortran Compiler. When debugging code, this compiler allows you to use the debugger for the Microsoft .NET platform.

If you have Compaq Visual Fortran 6.6 installed, you can use the Intel Fortran Compiler 7.1 instead of the original compiler because these compilers are compatible at the source code level.

The Intel Fortran Compiler 7.1 for Windows is fully compliant with the ISO Fortran 95 standard and supports building and debugging bilingual applications C and Fortran.

Compilers for the Linux platform

Intel C++ Compiler 7.1 for Linux

Another compiler that was released at the beginning of the year, Intel C++ Compiler 7.1 for Linux, allows you to achieve a high degree of code optimization for Intel Itanium, Intel Itanium 2, Intel Pentium 4, Intel Pentium M processors. This compiler is fully compatible with the GNU C compiler at the level source code and object modules, allowing applications built with GNU C to be migrated to it at no extra cost. operating systems SCO, early versions of Sun Solaris, etc.), which means full compatibility with the gcc 3.2 compiler at the binary level. Finally, with the Intel C++ Compiler 7.1 for Linux, you can even recompile the Linux kernel with a few minor changes to its source code.

Intel Fortran Compiler 7.1 for Linux

The Intel Fortran Compiler 7.1 compiler for Linux allows you to create optimized code for Intel Itanium, Intel Itanium 2, Intel Pentium 4, Intel Pentium M processors. This compiler is fully compatible with the Compaq Visual Fortran 6.6 compiler at the source code level, allowing you to recompile applications with it created with Compaq Visual Fortran, thus improving their performance.

In addition, the specified compiler is compatible with utilities used by developers, such as the emacs editor, the gdb debugger, and the make application build utility.

Like the Windows version of this compiler, Intel Fortran Compiler 7.1 for Linux is fully compatible with the ISO Fortran 95 standard and supports the creation and debugging of applications containing code in two languages ​​C and Fortran.

It should be emphasized that a significant contribution to the creation of the listed Intel compilers was made by specialists Russian center Intel for software development in Nizhny Novgorod. More information about Intel compilers can be found on the Intel Web site at: www.intel.com/software/products/ .

The second part of this article will be devoted to Intel compilers that create applications for mobile devices.

Examples of real hacks: Intel C++ 7.0 Compiler — WASM.RU Archive

…the Intel C++ 7.0 compiler downloaded late at night, around 5:00 in the morning. I wanted to sleep incredibly, but curiosity: whether the protection was strengthened or not, also tore apart. Deciding that until I deal with the protection, I still won’t fall asleep, I, having opened new console, and reinstalling the TEMP and TMP system variables to the C:\TEMP directory, hastily typed indecently long installer name W_CC_P_7.0.073.exe into command line(The need to set the TEMP and TMP variables is due to the fact that in Windows 2000 they point to a very deeply nested directory by default, and the Intel C ++ installer - and not only it - does not support such huge paths).

It immediately became clear that the protection policy had been radically revised and now the presence of a license was checked already at the stage of installing the program (in version 5.x, installation was carried out without problems). OK, we give the dir command and look at the contents of what we have to fight now:

    Contents of the folder C:\TMP\IntelC++Compiler70

    17.03.2003 05:10

    html

    17.03.2003 05:11

    x86

    17.03.2003 05:11

    Itanium

    17.03.2003 05:11

    notes

    06/05/2002 10:35 45 056 AutoRun.exe

    07/10/2001 12:56 27 autorun.inf

    10/29/2002 11:25 AM 2,831 ccompindex.htm

    10/24/2002 08:12 126 976 ChkLic.dll

    10/18/2002 10:37 552 960 chklic.exe

    10/17/2002 04:29 PM 28,663 CLicense.rtf

    10/17/2002 04:35 PM 386 credist.txt

    16.10.2002 17:02 34 136 Crelnotes.htm

    03/19/2002 02:28 PM 4,635 PLSuite.htm

    21.02.2002 12:39 2 478 register.htm

    02.10.2002 14:51 40 960 Setup.exe

    02.10.2002 10:40 151 Setup.ini

    10.07.2001 12:56 184 setup.mwg

    19 files 2,519,238 bytes

    6 folders 886 571 008 bytes free

Aha! The setup.exe installer takes only forty-something kilobytes. Very good! You can hardly hide serious protection in such a volume, and even if so, this tiny file does not cost anything to analyze in its entirety - to the last byte of the disassembler listing. However, it is not a fact that the security code is located exactly in setup.exe, it can be located in another place, for example ... ChkLic.dll / ChkLic.exe, which together occupy a little less than seven hundred kilobytes. Wait, what is ChkLic? Is that short for Check License? Um, the guys at Intel obviously have serious problems with a sense of humor. It would be better if they called this file "Hack Me" honestly! Well, judging by the volume, ChkLic is the same FLEX lm, and we have already encountered it (see "Intel C++ 5.0 Compiler") and approximately imagine how to break it.

We give the command "dumpbin / EXPORTS ChkLic.dll" to examine the exported functions and ... hold on tightly to Klava so as not to fall off the chair:

    Dump of file ChkLic.dll

  1. Section contains the following exports for ChkLic.dll

    0 characteristics

    3DB438B4 time date stamp Mon Oct 21 21:26:12 2002

  2. 1 number of functions

    1 number of names

    ordinal hint RVA name

    1 0 000010A0_CheckValidLicense

Damn it! Protection exports just one single function with the wonderful name CheckValidLicense. "Remarkable" - because the purpose of the function becomes clear from its name and it becomes possible to avoid painstaking analysis of the disassembler code. Well, they recaptured all interest ... it would be better if they exported it in an ordinal or something, or at least christened it some kind of frightening name like DES Decrypt.

...dreaming! Okay, back to our sheep. Let's think logically: if all the security code is located directly in ChkLic.dll (and, judging by the "hinged" nature of the security, this is true), then all "protection" comes down to calling CheckValidLicense from Setup.exe and checking the result returned by it. Therefore, to "hack" it is enough just to patch ChkLic.dll, forcing the ChekValidLicense function to always return ... by the way, what should it return? More precisely: what exactly is the return value corresponding to a successful license check? No, don't rush to disassemble setup.exe to find out, because there are not so many possible options: either FALSE or TRUE. Are you betting on TRUE? Well, in a sense, this is logical, but on the other hand: why did we, in fact, decide that the CheckValidLicense function returns exactly the success flag of the operation, and not the error code? After all, it must somehow motivate the reasons for refusing to install the compiler: the file with the license was not found, the file is damaged, the license has expired, and so on? Okay, let's try to return a zero, and if that doesn't work, we'll return a one.

OK, buckle up, let's go! We launch HIEW, open the ChkLic.dll file (if it does not open, remember the gophers three times, temporarily copy it to the root or any other directory that does not contain special characters in its name that hiew "y" does not like so much). Then, turning again to the export table obtained using dumpbin, we determine the address of the CheckValidLicense function (in this case, 010A0h) and through "10A0" we go to its beginning. Now, we cut it live, overwriting over the old code "XOR EAX, EAX / RETN 4. Why exactly "REN 4", and not just "RET"? Yes, because the function supports the stdcall convention, which can be found out by looking at its epilogue in HIEW "e (just scroll down the disassembler screen until meet RET).

Checking... It works!!! Despite the absence of a license, the installer starts the installation without asking too many questions! So the defense has fallen. Oh, we can’t believe that everything is so simple, and in order not to sit, staring blankly at the monitor while waiting for the program installation process to complete, we set our favorite IDA disassembler on setup.exe. The first thing that catches your eye is the absence of CheckValidLicense in the list of imported functions. Maybe she somehow launches the ChkLic.exe file? Let's try to find the appropriate link among the automatically recognized strings: "~View aNames", "ChkLic"... yeah, the string "Chklic.exe" is not here at all, but "Chklic.dll" is found. Yeah, I see, that means that the ChkLic library is loaded by explicit linking through LoadLibrary. And following the cross-reference confirms this:

    Text:0040175D push offset aChklic_dll ; lpLibFileName

    Text:00401762 call ds:LoadLibraryA

    Text:00401762 ; load ChkLic.dll ^^^^^^^^^^^^^^^^^

    Text:00401762 ;

    Text:00401768 mov esi,eax

    Text:0040176A push offset a_checkvalidlic ; lpProcName

    Text:0040176F push esi ; hModule

    Text:00401770 call ds:GetProcAddress

    Text:00401770 ; get the address of the CheckValidLicense function

    Text:00401770 ;

    Text:00401776 cmp esi, ebx

    Text:00401778 jz loc_40192E

    Text:00401778 ; if there is no such library, then exit the installer

    Text:00401778 ;

    Text:0040177E cmp eax, ebx

    Text:00401780 jz loc_40192E

    Text:00401780 ; if there is no such function in the library, then exit the installation

    Text:00401780 ;

    Text:00401786 push ebx

    Text:00401787 call eax

    Text:00401787 ; call the function CheckValidLicense

    Text:00401787 ;

    Text:00401789 test eax, eax

    Text:0040178B jnz loc_4019A3

Text:0040178 ; if the function returned non-zero, then exit the installer

Incredibly, this terribly primitive defense is built just like that! Moreover, the half-meter ChkLic.exe file is not needed at all! And why was it worth dragging it from the Internet? By the way, if you decide to save the compiler distribution (attention: I did not say "distribute"!), then to save disk space, you can delete ChkLic. ChkLic.dll, which exports the stdcall function CheckValidLicence of the form: int CheckValidLicence(int some_flag) ( return 0;)

So, while we were discussing all this, the installer finished installing the compiler and successfully completed its work. Whether it is interesting whether the compiler will be started or all most interesting only begins? We feverishly go down the branched hierarchy of nested folders, find icl.exe, which, as expected, is located in the bin directory, click and ... The compiler naturally does not start, referring to the fact that "icl: error: could not checkout FLEX lm license" without which he cannot continue his work.

It turns out that Intel applied multi-level protection and the first level turned out to be a rough defense against fools. Well! We accept this challenge and, based on our previous experience, automatically look for the LMGR*.DLL file in the compiler directory. Useless! This time, there is no such file here, but it turns out that icl.exe has gained a lot of weight, exceeding the mark of six hundred kilobytes ... Stop! But did the developers of the compiler link this very FLEX lm with a static link? We look: in Intel C++ 5.0, the sum of the sizes of lmgr327.dll and icl.exe was 598 KB, and now icl.exe alone takes 684 KB. After adjusting for natural senile "obesity", the numbers converge very well. So, after all, FLEX lm! Oh oh! But now, without the symbolic names of functions, it will be much more difficult to break the protection ... However, let's not panic ahead of time! Let's just think calmly! It is unlikely that the development team completely rewrote all the code that interacts with this "envelope" protection. Most likely, its "improvement" by just changing the layout type ended. And if so, then the chances of hacking the program are still great!

Bearing in mind that the last time the security code is in main functions, we, having determined its address, simply set a breakpoint and, after waiting for the debugger to pop up, stupidly trace the code, alternately glancing either at the debugger or at the program output window: did an abusive message appear there? At the same time, we mark all the conditional jumps that we come across on a separate piece of paper (or put it aside in our own memory, if you so desire), not forgetting to indicate whether each conditional jump was performed or not ... Stop! We chatted something, but the abusive message has already popped up! OK well! Let's see what conditional transition corresponded to it. Our records show that the last branch encountered was the JNZ conditional branch, located at address 0401075h and "reacting" to the result returned by sub_404C0E:

  • Text:0040107F loc_40107F: ; CODE XREF: _main+75^j

    Text:0040107F mov eax, offset aFfrps ; "FFrps"

    Text:00401084 mov edx, 21h

    Text:00401089 call sub_404C0E

    Text:0040108E test eax, eax

    Text:00401090 jnz short loc_40109A

    Obviously, sub_404C0E is the same protective procedure that checks the license for its presence. How to trick her? Well, there are a lot of options... First, you can thoughtfully and scrupulously analyze the contents of sub_404C0E to find out: what exactly and how exactly it checks. Secondly, you can simply replace JNZ short loc_40107F with JZ short loc_40107F or even NOP, NOP. Thirdly, the command for checking the return result TEST EAX, EAX can be turned into a zero setting command: XOR EAX, EAX. Fourth, sub_404C0E itself can be patched so that it always returns zero. I don’t know about you, but I liked method number three the most. We change two bytes and start the compiler. If there are no other checks of its "licensedness" in the protection, then the program will work and, accordingly, vice versa. (As we remember, there were two such checks in the fifth version). It's amazing, but the compiler no longer swears and works!!! Indeed, as expected, its developers did not strengthen the protection at all, but, on the contrary, even weakened it! Chris Kaspersky



  • Loading...
    Top