ChatGPT解决这个技术问题 Extra ChatGPT

What is the difference between native code, machine code and assembly code?

I'm confused about machine code and native code in the context of .NET languages.

What is the difference between them? Are they the same?

I have a question regarding this question. Does this question fall under StackOverflow's requirement? afaik it's not, but at the same time this kind of question is very helpful/informative. Assuming this type of question is not allowed, where we should ask these type of questions if not here?

T
Timwi

The terms are indeed a bit confusing, because they are sometimes used inconsistently.

Machine code: This is the most well-defined one. It is code that uses the byte-code instructions which your processor (the physical piece of metal that does the actual work) understands and executes directly. All other code must be translated or transformed into machine code before your machine can execute it.

Native code: This term is sometimes used in places where machine code (see above) is meant. However, it is also sometimes used to mean unmanaged code (see below).

Unmanaged code and managed code: Unmanaged code refers to code written in a programming language such as C or C++, which is compiled directly into machine code. It contrasts with managed code, which is written in C#, VB.NET, Java, or similar, and executed in a virtual environment (such as .NET or the JavaVM) which kind of “simulates” a processor in software. The main difference is that managed code “manages” the resources (mostly the memory allocation) for you by employing garbage collection and by keeping references to objects opaque. Unmanaged code is the kind of code that requires you to manually allocate and de-allocate memory, sometimes causing memory leaks (when you forget to de-allocate) and sometimes segmentation faults (when you de-allocate too soon). Unmanaged also usually implies there are no run-time checks for common errors such as null-pointer dereferencing or array bounds overflow.

Strictly speaking, most dynamically-typed languages — such as Perl, Python, PHP and Ruby — are also managed code. However, they are not commonly described as such, which shows that managed code is actually somewhat of a marketing term for the really big, serious, commercial programming environments (.NET and Java).

Assembly code: This term generally refers to the kind of source code people write when they really want to write byte-code. An assembler is a program that turns this source code into real byte-code. It is not a compiler because the transformation is 1-to-1. However, the term is ambiguous as to what kind of byte-code is used: it could be managed or unmanaged. If it is unmanaged, the resulting byte-code is machine code. If it is managed, it results in the byte-code used behind-the-scenes by a virtual environment such as .NET. Managed code (e.g. C#, Java) is compiled into this special byte-code language, which in the case of .NET is called Common Intermediate Language (CIL) and in Java is called Java byte-code. There is usually little need for the common programmer to access this code or to write in this language directly, but when people do, they often refer to it as assembly code because they use an assembler to turn it into byte-code.


C++ can compile to machine code, but it is very often compiled to other formats like exe that will run with an operating system.
There are languages that do support garbage collection and opaque references that typically compile to machine code. Most serious implementations of Common Lisp do that. What you say may be true of Microsoft-supported languages, but there's more compiled languages than are supported by Visual Studio.
@CrazyJugglerDrummer: The code contained in EXE files generated by C++ compilers is still machine code. @David Thornley: I mentioned significantly more languages than just those, but I didn’t want to complicate matters by mentioning every obscure oddity.
Some compilers, many, will actually compile from C/C++ or other languages to assembly language then call the assembler and the assembler turns it into object files which are mostly machine code but need a few touches before they can go into memory on the processor then the linker links all of it into the machine code version of the program. The point being C/C++, etc often does not compile straight to machine code it invisible to the user does a two or three step on the way. TCC for example is an exception to this it does go directly to machine code.
This feels like nitpicking, but not all assemblers translate 1-1 to opcodes. In fact, many modern assemblers support abstraction constructs like classes. Example: TASM, Borland's assembler. en.wikipedia.org/wiki/TASM
H
Hans Passant

What you see when you use Debug + Windows + Disassembly when debugging a C# program is a good guide for these terms. Here's an annotated version of it when I compile a 'hello world' program written in C# in the Release configuration with JIT optimization enabled:

        static void Main(string[] args) {
            Console.WriteLine("Hello world");
00000000 55                push        ebp                           ; save stack frame pointer
00000001 8B EC             mov         ebp,esp                       ; setup current frame
00000003 E8 30 BE 03 6F    call        6F03BE38                      ; Console.Out property getter
00000008 8B C8             mov         ecx,eax                       ; setup "this"
0000000a 8B 15 88 20 BD 02 mov         edx,dword ptr ds:[02BD2088h]  ; arg = "Hello world"
00000010 8B 01             mov         eax,dword ptr [ecx]           ; TextWriter reference
00000012 FF 90 D8 00 00 00 call        dword ptr [eax+000000D8h]     ; TextWriter.WriteLine()
00000018 5D                pop         ebp                           ; restore stack frame pointer
        }
00000019 C3                ret                                       ; done, return

Right-click the window and tick the "Show Code Bytes" to get a similar display.

The column on the left is the machine code address. Its value is faked by the debugger, the code is actually located somewhere else. But that could be anywhere, depending on the location selected by the JIT compiler, so the debugger just starts numbering addresses from 0 at the start of the method.

The second column is the machine code. The actual 1s and 0s that the CPU executes. Machine code, like here, is commonly displayed in hex. Illustrative perhaps is that 0x8B selects the MOV instruction, the additional bytes are there to tell the CPU exactly what needs to be moved. Also note the two flavors of the CALL instruction, 0xE8 is the direct call, 0xFF is the indirect call instruction.

The third column is the assembly code. Assembly is a simple language, designed to make it easier to write machine code. It compares to C# being compiled to IL. The compiler used to translate assembly code is called an "assembler". You probably have the Microsoft assembler on your machine, its executable name is ml.exe, ml64.exe for the 64-bit version. There are two common versions of assembly languages in use. The one you see is the one that Intel and AMD use. In the open source world, assembly in the AT&T notation is common. The language syntax is heavily dependent on the kind of CPU for which is was written, the assembly language for a PowerPC is very different.

Okay, that tackles two of the terms in your question. "Native code" is a fuzzy term, it isn't uncommonly used to describe code in an unmanaged language. Instructive perhaps is to see what kind of machine code is generated by a C compiler. This is the 'hello world' version in C:

int _tmain(int argc, _TCHAR* argv[])
{
00401010 55               push        ebp  
00401011 8B EC            mov         ebp,esp 
    printf("Hello world");
00401013 68 6C 6C 45 00   push        offset ___xt_z+128h (456C6Ch) 
00401018 E8 13 00 00 00   call        printf (401030h) 
0040101D 83 C4 04         add         esp,4 
    return 0;
00401020 33 C0            xor         eax,eax 
}
00401022 5D               pop         ebp  
00401023 C3               ret   

I didn't annotate it, mostly because it is so similar to the machine code generated by the C# program. The printf() function call is quite different from the Console.WriteLine() call but everything else is about the same. Also note that the debugger is now generating the real machine code address and that it is a bit smarter about symbols. A side effect of generating debug info after generating machine code like unmanaged compilers often do. I should also mention that I turned off a few machine code optimization options to make the machine code look similar. C/C++ compilers have a lot more time available to optimize code, the result is often hard to interpret. And very hard to debug.

Key point here is there are very few differences between machine code generated from a managed language by the JIT compiler and machine code generated by a native code compiler. Which is the primary reason why the C# language can be competitive with an native code compiler. The only real difference between them are the support function calls. Many of which are implemented in the CLR. And that revolves primary around the garbage collector.


c
cHao

Native code and machine code are the same thing -- the actual bytes that the CPU executes.

Assembly code has two meanings: one is the machine code translated into a more human-readable form (with the bytes for the instructions translated into short wordlike mnemonics like "JMP" (which "jumps" to another spot in the code). The other is the IL bytecode (instruction bytes that compilers like C# or VB generate, that will end up translated into machine code eventually, but aren't yet) that lives in a DLL or EXE.


This answer is ambiguous and serves to pervert the true definitions
H
Henk Holterman

In .NET, assemblies contain MS Intermediate Language code (MSIL, sometimes CIL).
It is like a 'high level' machine code.

When loaded, MSIL is compiled by the JIT compiler into native code (Intel x86 or x64 machine code).