|
Virtual Memory Area
Virtual Memory Area is also called Memory Region in some book.
In the process address spaces, there are many memory areas, contiguous addresses wil be divided into different memory area if the access right of them are different. For example, in one Java Process, there are 359 memory areas.
so the kernel need to find a effective way to insert into, remove from, search from the list of memory areas. The semantics of find_area API is the as the following.
return null if
1. The list itself is empty.
2. The list is not empty, and the address is big than the last memory area.
return found area if
1. the address is in the region of one area.
2. the address is not in the region of any area. but is not bigger than the last area.
it means it is in the hole between areas. right area besides the hole is returned.
The kernel are trying to use as little resource as possible. Here is an example, Originally, in kernerl 2.4, the size of Kernel Stack is 8K. Now, in kernel 2.6, it could be 4K, if you enable it in compilaiton time.
Why will kernel spend effort to support such a feature when most of PC have more than 1 Gigabyte memonry. I think it has something to do with the C10K probleum; C10K means Concurrent 10 Thousand Processes(Threads). considering a system with more thant 10 thousand processes, such as a WEB server, the save of 4K in every kernel stack will become 4K * 10 K = 40 M tatal save of memory, which is a big deal!
How is it possible to achieve that? originally the kernel mode stack is also used in Exception and Interrupt handling, but Exception and Interrupt handling is not specific to any process. so in 2.6, Interrupt and Exception will have their own Stack for each CPU. Kernel stack is only used by process in the kernel mode. so the acutal kernel stack did not become small.
2.4 8K Stack shared between process kernel mode, Exception, Interrupt.
v.s
2.6 4K Stack specific for process kernel mode Stack
4K Stack specific for Exception Stack
4K Stack specific for Interrupt Stack
Besides this, in 8K stack of 2.4, task_struct is at the bottom of stack, which may cost about 1K, in 4K stack of 2.6, only thread_info is at the bottom of stack, the task_struct is put into a per-CPU data structre, thread_info is only about 50 bytes.
Here is just the high level summary of my understanding on Linux Kernel Memory Management. I think it can help achieve a better understanding of the book <<understanding linux kernel>>.
It is said, the memory management is most complex sub-system in linux kernel, at the same time, there aren't too much System Calls for it. Becuase most the the complex mechanism happens trasparently to the user process, such as COW(Copy On Write), On Demand Paging. For user process, to successfully refer to a linear memory address, the following factors are necessary:
vm_area_struct (Virtual Memory Area, Memory Region) are set up correctly.
Phsical memory are allocated.
Page Global Directory, Page Table, and the corresponding entry are correclty set up according to Virtual Memory Area and Phisical Meory.
This three factors can be further simplified as
Virtual Memory
Phisical Memory
Mappting between Virtual Momory and Phisical Memory.
From user process's perspective, only Virtual Memory is visible, when user process applys for memory, he got virtual memory; phisical memory may not be allocated yet. All these three factors are managed by the kernel, they can be thought of as three resource managed by the kernel. kernel not only need to manage the Virtual Memoty in user address space, but also need to manage Virtual Memory in kernel address space.
When user process try to use his virtual memory, but the phisical memory is not allocated yet. Page Exception happens, kernel take charge of it and allocate the phisical memory and set up the mapping. user process reexecute the instruction and everything go forward smoothly. It's called On Demand Paging.
Besides that there are many more concepts, such as Memory mapping, non-linear memory mapping. I will continue this article when I dig into the details.
ps -H -A
can show the relationship between all the processes in a tree format. it is helpful when you want to research the internals of UNIX.
init
keventd
ksoftirqd/0
bdflush
kswapd
we can see from the above that all the process are the children of init (directly or indirectly). especially the kernel thread are also the children of init process.
process 0 is special, it is not displayed.
From the following:
sshd
sshd
sshd
bash
vim
cscope
sshd
sshd
bash
ps
we can see that how ssh works. actually I have created two ssh session to the server.
根據以下Xusage.txt中的說明:
-Xms<size> set initial Java heap size
-Xmx<size> set maximum Java heap size
Java -Xms512M 應該為Java分配至少512M的內存,但是在Linux中用TOP查看,其RSS和SIZE的值遠小于512M。我的理解是Java向操作系統申請內存時,用的是mmap2或者old_mmap系統調用,這兩個系統調用其實都沒有真正分配物理內存,而僅僅是分配了虛擬內存。所以預先分配的這些內存要到實際使用時才能落實到位。
There are not too much grammar. here is just the incomplete summary for the future reference.
meta-character
. any character
| or
() grouping
[] character class
[^] negative character class
Greedy Quantifier
? optional
* any amount
+ at least one
lazy quantifier
??
*?
+?
possessing quantifier
?+
*+
++
position related
^ start ot the line
\A
$ end of the line
\Z
\< start of the word
\> end of the word
\b start or end of the word
non-capturing group (?:Expression)
non-capturing atomic group (?>Expression)
positive lookahead (?=Expression)
negative lookahead (?!Expression)
positive lookbehind (?<=Expression)
negative lookbehind (?<!Expression)
\Q start quoting
\E end quoting
mode modifier
(?modifier)Expression(?-modifier)
valid modifier
i case insensitive match mode
x free spacing
s dot matches all match mode
m enhanced line-anchor match mode
(?modifier:Expression)
comments:
(?#Comments)
kernel memory mapping summay
Today, finally I become clear about the relationship between
fixed mapping
permanent kernel mapping
temporary kernel mapping
noncontiguous memory area mapping
(I feel that most of the name is not appropriate, to some text, it will mislead the reader.)
4G linear virtual address space is divided into two major part.
kernel space mapping [3G, 4G)
user space mapping [0, 3G)
kernel space mapping is divided into more pieces
linear mapping [3G, 3G + 896M)
non linear mapping [3G + 896M + 8M, 4G)
1. Fixed Mapping (wrong name, should be compile time mapping, the virtual address is decided in compile time. )
2. Temporary mapping
3. Permanent mapping
4. noncontiguous memory area mapping (Vmalloc area)
The following is the diagram for the reference.
FIXADDR_TOP (=0xfffff000)
fixed_addresses (temporary kernel mapping is part of it)
#define __FIXADDR_SIZE (__end_of_permanent_fixed_addresses << PAGE_SHIFT)
FIXADDR_START (FIXADDR_TOP - __FIXADDR_SIZE)
temp fixed addresses (used in boot time)
#define __FIXADDR_BOOT_SIZE (__end_of_fixed_addresses << PAGE_SHIFT)
FIXADDR_BOOT_START (FIXADDR_TOP - __FIXADDR_BOOT_SIZE)
Persistent kmap area (4M)
PKMAP_BASE ( (FIXADDR_BOOT_START - PAGE_SIZE*(LAST_PKMAP + 1)) & PMD_MASK )
2*PAGE_SIZE
VMALLOC_END (PKMAP_BASE-2*PAGE_SIZE) or (FIXADDR_START-2*PAGE_SIZE)
noncontiguous memory area mapping (Vmalloc area)
VMALLOC_START (((unsigned long) high_memory + 2*VMALLOC_OFFSET-1) & ~(VMALLOC_OFFSET-1))
high_memory MIN (896M, phisical memory size)
below the excerp of the source code.
#ifdef CONFIG_X86_PAE
#define LAST_PKMAP 512
#else
#define LAST_PKMAP 1024
#endif
#define VMALLOC_OFFSET (8*1024*1024)
#define VMALLOC_START (((unsigned long) high_memory + \
2*VMALLOC_OFFSET-1) & ~(VMALLOC_OFFSET-1))
#ifdef CONFIG_HIGHMEM
# define VMALLOC_END (PKMAP_BASE-2*PAGE_SIZE)
#else
# define VMALLOC_END (FIXADDR_START-2*PAGE_SIZE)
#endif
enum fixed_addresses {
FIX_HOLE,
FIX_VDSO,
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
#ifdef CONFIG_X86_LOCAL_APIC
FIX_APIC_BASE, /* local (CPU) APIC) -- required for SMP or not */
#endif
#ifdef CONFIG_X86_IO_APIC
FIX_IO_APIC_BASE_0,
FIX_IO_APIC_BASE_END = FIX_IO_APIC_BASE_0 + MAX_IO_APICS-1,
#endif
#ifdef CONFIG_X86_VISWS_APIC
FIX_CO_CPU, /* Cobalt timer */
FIX_CO_APIC, /* Cobalt APIC Redirection Table */
FIX_LI_PCIA, /* Lithium PCI Bridge A */
FIX_LI_PCIB, /* Lithium PCI Bridge B */
#endif
#ifdef CONFIG_X86_F00F_BUG
FIX_F00F_IDT, /* Virtual mapping for IDT */
#endif
#ifdef CONFIG_X86_CYCLONE_TIMER
FIX_CYCLONE_TIMER, /*cyclone timer register*/
#endif
#ifdef CONFIG_HIGHMEM
FIX_KMAP_BEGIN, /* reserved pte's for temporary kernel mappings */
FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1,
#endif
#ifdef CONFIG_ACPI
FIX_ACPI_BEGIN,
FIX_ACPI_END = FIX_ACPI_BEGIN + FIX_ACPI_PAGES - 1,
#endif
#ifdef CONFIG_PCI_MMCONFIG
FIX_PCIE_MCFG,
#endif
#ifdef CONFIG_PARAVIRT
FIX_PARAVIRT_BOOTMAP,
#endif
__end_of_permanent_fixed_addresses,
/* temporary boot-time mappings, used before ioremap() is functional */
#define NR_FIX_BTMAPS 16
FIX_BTMAP_END = __end_of_permanent_fixed_addresses,
FIX_BTMAP_BEGIN = FIX_BTMAP_END + NR_FIX_BTMAPS - 1,
FIX_WP_TEST,
__end_of_fixed_addresses
}
scale up - vertically scale
scale out - horizontally scale
scale out
1. Use share nothing clustering architectures
The session failover functionality cannot avoid errors completely when failures happen, as my article mentioned, but it will damage the performance and scalability.
2. Use scalable session replication mechanisms
The most scalable one is paired node replication, the least scalable solution is using database as session persistence storage.
3. Use collocated deployment instead of distributed one.
4. Shared resources and services
Database servers, JNDI trees, LDAP Servers, and external file systems can be shared by the nodes in the cluster.
5. Memcached
Memcached's magic lies in its two-stage hash approach. It behaves as though it were a giant hash table, looking up key = value pairs. Give it a key, and set or get some arbitrary data. When doing a memcached lookup, first the client hashes the key against the whole list of servers. Once it has chosen a server, the client then sends its request, and the server does an internal hash key lookup for the actual item data.
6. Terracotta
Terracotta extends the Java Memory Model of a single JVM to include a cluster of virtual machines such that threads on one virtual machine can interact with threads on another virtual machine as if they were all on the same virtual machine with an unlimited amount of heap.
7. Using unorthodox approach to achieve high scalability
今天遇到了一個奇怪的Hibernate問題。(我用得hibernate是2.1版。比較舊,不知道這個問題在hibernate 3 中是否存在。)
下面這個是捕捉到的異常堆棧。
java.lang.ClassCastException: java.lang.Boolean
at net.sf.hibernate.type.StringType.set(StringType.java:26)
at net.sf.hibernate.type.NullableType.nullSafeSet(NullableType.java:48)
at net.sf.hibernate.type.NullableType.nullSafeSet(NullableType.java:35)
at net.sf.hibernate.persister.EntityPersister.dehydrate(EntityPersister.java:393)
at net.sf.hibernate.persister.EntityPersister.insert(EntityPersister.java:466)
at net.sf.hibernate.persister.EntityPersister.insert(EntityPersister.java:442)
at net.sf.hibernate.impl.ScheduledInsertion.execute(ScheduledInsertion.java:29)
at net.sf.hibernate.impl.SessionImpl.executeAll(SessionImpl.java:2382)
at net.sf.hibernate.impl.SessionImpl.execute(SessionImpl.java:2335)
at net.sf.hibernate.impl.SessionImpl.flush(SessionImpl.java:2204)
奇怪之處在于程序在本機Tomcat上運行情況良好,一旦部署到Linux服務器上就掛了。
仔細分析之后,發現要存儲的對象既定義了get方法又定義了is方法。內容示例如下
public class FakePO {
String goodMan;
public String getGoodMan() {
return goodMan;
}
public void setGoodMan(String goodMan) {
this.goodMan = goodMan;
}
public boolean isGoodMan(){
return "Y".equalsIgnoreCase(goodMan);
}
}
懷疑可能是這個衍生的輔助方法isGoodMan()導致的問題。通過追蹤Hibernate 2的源代碼,發現hibernate 2是按如下方式通過反射API訪問PO的。
private static Method getterMethod(Class theClass, String propertyName) {
Method[] methods = theClass.getDeclaredMethods();
for (int i=0; i<methods.length; i++) {
// only carry on if the method has no parameters
if ( methods[i].getParameterTypes().length==0 ) {
String methodName = methods[i].getName();
// try "get"
if( methodName.startsWith("get") ) {
String testStdMethod = Introspector.decapitalize( methodName.substring(3) );
String testOldMethod = methodName.substring(3);
if( testStdMethod.equals(propertyName) || testOldMethod.equals(propertyName) ) return methods[i];
}
// if not "get" then try "is"
/*boolean isBoolean = methods[i].getReturnType().equals(Boolean.class) ||
methods[i].getReturnType().equals(boolean.class);*/
if( methodName.startsWith("is") ) {
String testStdMethod = Introspector.decapitalize( methodName.substring(2) );
String testOldMethod = methodName.substring(2);
if( testStdMethod.equals(propertyName) || testOldMethod.equals(propertyName) ) return methods[i];
}
}
}
return null;
}
仔細讀以上代碼可以發現,Hibernate就是簡單的遍歷類的public方法,看是否和屬性名稱匹配,并不檢查方法的返回值是否和屬性的類型匹配。所以在我們的例子中,既可能返回get方法,也可能返回is方法,取決于public方法列表的順序,而這個順序恰恰是沒有任何保證的。這也解釋了為什么這個問題只能在特定平臺上發生。
最近在看write系統調用的實現,雖然還有一下細節不是很清楚,但是大致的實現機理還是有一定的理解了。總結如下:
這里假設最普通的情況,不考慮Direct IO 的情況。從全家的高度看,要往一個文件中寫入內容,需要一下幾步。
1. sys_write 將用戶進程要寫的內容寫入到內核的文件頁面緩沖中。sys_write 本身到此就結束了。
2. pdflush 內核線程(定期或者由內核閾值觸發)刷新臟的頁面緩沖,其實只是提交IO請求給底層的驅動。
3. IO請求并不是同步執行的,而是由底層的驅動調度執行,發出DMA操作指令。
4. 物理IO完成之后會中斷并通知內核,內核負責更新IO的狀態。
先要去陪兒子睡覺了。有空會繼續細化各個部分的實現。
sys_write 的調用過程。(我的linux內核版本為2.6.24,文件系統為ext3)
asmlinkage ssize_t sys_write(unsigned int fd, const char __user * buf, size_t count)
vfs_write(file, buf, count, &pos);
file->f_op->write(file, buf, count, pos);
這里的file->fop 是在open一個文件是初始化的函數指針,ext3文件系統對應的函數為do_sync_write。
下面是其實現的要點。
for (;;) {
300 ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
301 if (ret != -EIOCBRETRY)
302 break;
303 wait_on_retry_sync_kiocb(&kiocb);
304 }
305
306 if (-EIOCBQUEUED == ret)
307 ret = wait_on_sync_kiocb(&kiocb);
filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); 是實現的核心,其函數指針指向ext3_file_write。
307行的作用在于等待IO的完成。這里的IO完成指的是進入IO的隊列而已,不是物理IO的完成。
generic_file_aio_write(iocb, iov, nr_segs, pos);
__generic_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
generic_file_buffered_write(iocb, iov, nr_segs, pos,ppos,count,written);
generic_file_direct_IO(WRITE, iocb, iov, pos, *nr_segs);
以下的調用序列還很長,一時還消化不了。僅供自己參考。
|