面试必问的epoll技术，从内核源码出发彻底搞懂epoll

汽车报价
买车新车
博客专栏
专题精品
教育留学
高考读书
房产家居
彩票视频
直播黑猫
投资微博
城市上海
政务旅游

面试必问的epoll技术，从内核源码出发彻底搞懂epoll

10月14日囍孤女投稿

　　epoll概述
　　epoll是linux中IO多路复用的一种机制，IO多路复用就是通过一种机制，一个进程可以监视多个描述符，一旦某个描述符就绪（一般是读就绪或者写就绪），能够通知程序进行相应的读写操作。当然linux中IO多路复用不仅仅是epoll，其他多路复用机制还有select、poll，但是接下来介绍epoll的内核实现。
　　网上关于epoll接口的介绍非常多，这个不是我关注的重点，但是还是有必要了解。该接口非常简单，一共就三个函数，这里我摘抄了网上关于该接口的介绍：intepollcreate（intsize）；
　　创建一个epoll的句柄，size用来告诉内核这个监听的数目一共有多大。这个参数不同于select（）中的第一个参数，给出最大监听的fd1的值。需要注意的是，当创建好epoll句柄后，它就是会占用一个fd值，在linux下如果查看proc进程idfd，是能够看到这个fd的，所以在使用完epoll后，必须调用close（）关闭，否则可能导致fd被耗尽。intepollctl（intepfd，intop，intfd，structepolleventevent）；
　　epoll的事件注册函数，它不同与select（）是在监听事件时告诉内核要监听什么类型的事件，而是在这里先注册要监听的事件类型。第一个参数是epollcreate（）的返回值，第二个参数表示动作，用三个宏来表示：
　　EPOLLCTLADD：注册新的fd到epfd中；
　　EPOLLCTLMOD：修改已经注册的fd的监听事件；
　　EPOLLCTLDEL：从epfd中删除一个
　　第三个参数是需要监听的fd，第四个参数是告诉内核需要监听什么事，structepollevent结构如下：structepollevent｛uint32EUserdatavariable｝；
　　events可以是以下几个宏的集合：
　　EPOLLIN：表示对应的文件描述符可以读（包括对端SOCKET正常关闭）；
　　EPOLLOUT：表示对应的文件描述符可以写；
　　EPOLLPRI：表示对应的文件描述符有紧急的数据可读（这里应该表示有带外数据到来）；
　　EPOLLERR：表示对应的文件描述符发生错误；
　　EPOLLHUP：表示对应的文件描述符被挂断；
　　EPOLLET：将EPOLL设为边缘触发（EdgeTriggered）模式，这是相对于水平触发（LevelTriggered）来说的。
　　EPOLLONESHOT：只监听一次事件，当监听完这次事件之后，如果还需要继续监听这个socket的话，需要再次把这个socket加入到EPOLL队列里intepollwait（intepfd，structepolleventevents，intmaxevents，inttimeout）；
　　等待事件的产生，类似于select（）调用。参数events用来从内核得到事件的集合，maxevents告之内核这个events有多大，这个maxevents的值不能大于创建epollcreate（）时的size（备注：在4。1。2内核里面，epollcreate的size没有什么用），参数timeout是超时时间（毫秒，0会立即返回，小于0时将是永久阻塞）。该函数返回需要处理的事件数目，如返回0表示已超时
　　epoll相比selectpoll的优势：selectpoll每次调用都要传递所要监控的所有fd给selectpoll系统调用（这意味着每次调用都要将fd列表从用户态拷贝到内核态，当fd数目很多时，这会造成低效）。而每次调用epollwait时（作用相当于调用selectpoll），不需要再传递fd列表给内核，因为已经在epollctl中将需要监控的fd告诉了内核（epollctl不需要每次都拷贝所有的fd，只需要进行增量式操作）。所以，在调用epollcreate之后，内核已经在内核态开始准备数据结构存放要监控的fd了。每次epollctl只是对这个数据结构进行简单的维护。selectpoll一个致命弱点就是当你拥有一个很大的socket集合，不过由于网络延时，任一时间只有部分的socket是活跃的，但是selectpoll每次调用都会线性扫描全部的集合，导致效率呈现线性下降。但是epoll不存在这个问题，它只会对活跃的socket进行操作这是因为在内核实现中epoll是根据每个fd上面的callback函数实现的。当我们调用epollctl往里塞入百万个fd时，epollwait仍然可以飞快的返回，并有效的将发生事件的fd给我们用户。这是由于我们在调用epollcreate时，内核除了帮我们在epoll文件系统里建了个file结点，在内核cache里建了个红黑树用于存储以后epollctl传来的fd外，还会再建立一个list链表，用于存储准备就绪的事件，当epollwait调用时，仅仅观察这个list链表里有没有数据即可。有数据就返回，没有数据就sleep，等到timeout时间到后即使链表没数据也返回。所以，epollwait非常高效。而且，通常情况下即使我们要监控百万计的fd，大多一次也只返回很少量的准备就绪fd而已，所以，epollwait仅需要从内核态copy少量的fd到用户态而已。那么，这个准备就绪list链表是怎么维护的呢？当我们执行epollctl时，除了把fd放到epoll文件系统里file对象对应的红黑树上之外，还会给内核中断处理程序注册一个回调函数，告诉内核，如果这个fd的中断到了，就把它放到准备就绪list链表里。所以，当一个fd（例如socket）上有数据到了，内核在把设备（例如网卡）上的数据copy到内核中后就来把fd（socket）插入到准备就绪list链表里了。源码分析
　　epoll相关的内核代码在fseventpoll。c文件中，下面分别分析epollcreate、epollctl和epollwait三个函数在内核中的实现，分析所用linux内核源码为4。1。2版本。epollcreate
　　epollcreate用于创建一个epoll的句柄，其在内核的系统实现如下：
　　sysepollcreate：SYSCALLDEFINE1（epollcreate，int，size）｛if（size0）returnEINVAL；returnsysepollcreate1（0）；｝
　　可见，我们在调用epollcreate时，传入的size参数，仅仅是用来判断是否小于等于0，之后再也没有其他用处。
　　整个函数就3行代码，真正的工作还是放在sysepollcreate1函数中。
　　sysepollcreatesysepollcreate1：Openaneventpollfiledescriptor。SYSCALLDEFINE1（epollcreate1，int，flags）｛interror，structeventpollepNULL；ChecktheEPOLLconstantforconsistency。BUILDBUGON（EPOLLCLOEXEC！OCLOEXEC）；if（flagsEPOLLCLOEXEC）returnEINVAL；Createtheinternaldatastructure（structeventpoll）。errorepalloc（ep）；if（error0）Createsalltheitemsneededtosetupaneventpollfile。Thatis，afilestructureandafreefiledescriptor。fdgetunusedfdflags（ORDWR（flagsOCLOEXEC））；if（fd0）｛｝fileanoninodegetfile（〔eventpoll〕，eventpollfops，ep，ORDWR（flagsOCLOEXEC））；if（ISERR（file））｛errorPTRERR（file）；｝fdinstall（fd，file）；outfreefd：putunusedfd（fd）；outfreeep：epfree（ep）；｝
　　sysepollcreate1函数流程如下：首先调用epalloc函数申请一个eventpoll结构，并且初始化该结构的成员，这里没什么好说的，代码如下：
　　sysepollcreatesysepollcreate1epalloc：staticintepalloc（structeventpollpep）｛usergetcurrentuser（）；errorENOMEM；epkzalloc（sizeof（ep），GFPKERNEL）；if（unlikely（！ep））spinlockinit（eplock）；mutexinit（epmtx）；initwaitqueuehead（epwq）；initwaitqueuehead（eppollwait）；INITLISTHEAD（eprdllist）；eprbrRBROOT；epovflistEPUNACTIVEPTR；return0；freeuid：freeuid（user）；｝
　　接下来调用getunusedfdflags函数，在本进程中申请一个未使用的fd文件描述符。
　　sysepollcreatesysepollcreate1epallocgetunusedfdflags：intgetunusedfdflags（unsignedflags）｛returnallocfd（currentfiles，0，rlimit（RLIMITNOFILE），flags）；｝
　　linux内核中，current是个宏，返回的是一个taskstruct结构（我们称之为进程描述符）的变量，表示的是当前进程，进程打开的文件资源保存在进程描述符的files成员里面，所以currentfiles返回的当前进程打开的文件资源。rlimit（RLIMITNOFILE）函数获取的是当前进程可以打开的最大文件描述符数，这个值可以设置，默认是1024。
　　相关视频推荐：
　　支撑亿级io的底层基石epoll实战揭秘
　　网络原理tcpudp，网络编程epollreactor，面试中正经八股文
　　学习地址：CCLinux服务器开发后台架构师【零声教育】学习视频教程腾讯课堂
　　需要更多CCLinux服务器架构师学习资料加群812855908获取（资料包括CC，Linux，golang技术，Nginx，ZeroMQ，MySQL，Redis，fastdfs，MongoDB，ZK，流媒体，CDN，P2P，K8S，Docker，TCPIP，协程，DPDK，ffmpeg等），免费分享
　　allocfd的工作是为进程在〔start，end）之间（备注：这里start为0，end为进程可以打开的最大文件描述符数）分配一个可用的文件描述符，这里就不继续深入下去了，代码如下：
　　sysepollcreatesysepollcreate1epallocgetunusedfdflagsallocfd：allocateafiledescriptor，markitbusy。intallocfd（structfilesstructfiles，unsignedstart，unsignedend，unsignedflags）｛spinlock（filesfilelock）；repeat：fdtfilesfdtable（files）；if（fdfilesnextfd）if（fdfdtmaxfds）fdfindnextfd（fdt，fd）；N。B。Forclonetaskssharingafilesstructure，thistestwilllimitthetotalnumberoffilesthatcanbeopened。errorEMFILE；if（fdend）errorexpandfiles（files，fd）；if（error0）Ifweneededtoexpandthefsarraywemighthaveblockedtryagain。if（error）if（startfilesnextfd）filesnextfdfd1；setopenfd（fd，fdt）；if（flagsOCLOEXEC）setcloseonexec（fd，fdt）；elseclearcloseonexec（fd，fdt）；if1Sanitycheckif（rcuaccesspointer（fdtfd〔fd〕）！NULL）｛printk（KERNWARNINGallocfd：slotdnotNULL！，fd）；rcuassignpointer（fdtfd〔fd〕，NULL）；｝endifout：spinunlock（filesfilelock）；｝
　　然后，epollcreate1会调用anoninodegetfile，创建一个file结构，如下：
　　sysepollcreatesysepollcreate1anoninodegetfile：anoninodegetfilecreatesanewfileinstancebyhookingituptoananonymousinode，andadentrythatdescribetheclassofthefilename：〔in〕nameoftheclassofthenewfilefops：〔in〕fileoperationsforthenewfilepriv：〔in〕privatedataforthenewfile（willbefilesprivatedata）flags：〔in〕flagsCreatesanewfilebyhookingitonasingleinode。Thisisusefulforfilesthatdonotneedtohaveafullfledgedinodeinordertooperatecorrectly。Allthefilescreatedwithanoninodegetfile（）willshareasingleinode，hencesavingmemoryandavoidingcodeduplicationforthefileinodedentrysetup。Returnsthenewlycreatedfileoranerrorpointer。structfileanoninodegetfile（constcharname，conststructfileoperationsfops，voidpriv，intflags）｛if（ISERR（anoninodeinode））returnERRPTR（ENODEV）；if（fopsowner！trymoduleget（fopsowner））returnERRPTR（ENOENT）；Linktheinodetoadirectoryentrybycreatingauniquenameusingtheinodesequencenumber。fileERRPTR（ENOMEM）；this。this。lenstrlen（name）；this。hash0；path。dentrydallocpseudo（anoninodemntmntsb，this）；if（！path。dentry）path。mntmntget（anoninodemnt）；Weknowtheanoninodeinodecountisalwaysgreaterthanzero，soihold（）issafe。ihold（anoninodeinode）；dinstantiate（path。dentry，anoninodeinode）；fileallocfile（path，OPENFMODE（flags），fops）；if（ISERR（file））filefflagsflags（OACCMODEONONBLOCK）；errdput：pathput（path）；errmodule：moduleput（fopsowner）；｝
　　anoninodegetfile函数中首先会alloc一个file结构和一个dentry结构，然后将该file结构与一个匿名inode节点anoninodeinode挂钩在一起，这里要注意的是，在调用anoninodegetfile函数申请file结构时，传入了前面申请的eventpoll结构的ep变量，申请的fileprivatedata会指向这个ep变量，同时，在anoninodegetfile函数返回来后，epfile会指向该函数申请的file结构变量。
　　简要说一下filedentryinode，当进程打开一个文件时，内核就会为该进程分配一个file结构，表示打开的文件在进程的上下文，然后应用程序会通过一个int类型的文件描述符来访问这个结构，实际上内核的进程里面维护一个file结构的数组，而文件描述符就是相应的file结构在数组中的下标。
　　dentry结构（称之为目录项）记录着文件的各种属性，比如文件名、访问权限等，每个文件都只有一个dentry结构，然后一个进程可以多次打开一个文件，多个进程也可以打开同一个文件，这些情况，内核都会申请多个file结构，建立多个文件上下文。但是，对同一个文件来说，无论打开多少次，内核只会为该文件分配一个dentry。所以，file结构与dentry结构的关系是多对一的。
　　同时，每个文件除了有一个dentry目录项结构外，还有一个索引节点inode结构，里面记录文件在存储介质上的位置和分布等信息，每个文件在内核中只分配一个inode。dentry与inode描述的目标是不同的，一个文件可能会有好几个文件名（比如链接文件），通过不同文件名访问同一个文件的权限也可能不同。dentry文件所代表的是逻辑意义上的文件，记录的是其逻辑上的属性，而inode结构所代表的是其物理意义上的文件，记录的是其物理上的属性。dentry与inode结构的关系是多对一的关系。最后，epollcreate1调用fdinstall函数，将fd与file交给关联在一起，之后，内核可以通过应用传入的fd参数访问file结构，本段代码比较简单，不继续深入下去了。
　　sysepollcreatesysepollcreate1fdinstall：Installafilepointerinthefdarray。TheVFSisfullofplaceswherewedropthefileslockbetweensettingtheopenfdsbitmapandinstallingthefileinthefilearray。Atanysuchpoint，wearevulnerabletoadup2（）raceinstallingafileinthearraybeforeus。Weneedtodetectthisandfput（）thestructfileweareabouttooverwriteinthiscase。Itshouldneverhappenifweallowdup2（）doit，reallybadthingswillfollow。NOTE：fdinstall（）variantisreally，dontuseitunlessyouareforcedtobytrulylousyAPIshoveddownyourthroat。filesMUSTbeeithercurrentfilesorobtainedbygetfilesstruct（current）donebywhoeverhadgivenittoyou，orreallybadthingswillhappen。Normallyyouwanttousefdinstall（）instead。voidfdinstall（structfilesstructfiles，unsignedintfd，structfilefile）｛mightsleep（）；rcureadlocksched（）；while（unlikely（filesresizeinprogress））｛rcureadunlocksched（）；waitevent（filesresizewait，！filesresizeinprogress）；rcureadlocksched（）；｝coupledwithsmpwmb（）inexpandfdtable（）smprmb（）；fdtrcudereferencesched（filesfdt）；BUGON（fdtfd〔fd〕！NULL）；rcuassignpointer（fdtfd〔fd〕，file）；rcureadunlocksched（）；｝voidfdinstall（unsignedintfd，structfilefile）｛fdinstall（currentfiles，fd，file）；｝
　　总结epollcreate函数所做的事：调用epollcreate后，在内核中分配一个eventpoll结构和代表epoll文件的file结构，并且将这两个结构关联在一块，同时，返回一个也与file结构相关联的epoll文件描述符fd。当应用程序操作epoll时，需要传入一个epoll文件描述符fd，内核根据这个fd，找到epoll的file结构，然后通过file，获取之前epollcreate申请eventpoll结构变量，epoll相关的重要信息都存储在这个结构里面。接下来，所有epoll接口函数的操作，都是在eventpoll结构变量上进行的。
　　所以，epollcreate的作用就是为进程在内核中建立一个从epoll文件描述符到eventpoll结构变量的通道。epollctl
　　epollctl接口的作用是添加修改删除文件的监听事件，内核代码如下：
　　sysepollctl：Thefollowingfunctionimplementsthecontrollerinterfacefortheeventpollfilethatenablestheinsertionremovalchangeoffiledescriptorsinsidetheinterestset。SYSCALLDEFINE4（epollctl，int，epfd，int，op，int，fd，structepolleventuser，event）｛intfullcheck0；structfdf，structeventpolltepNULL；errorEFAULT；if（epophasevent（op）copyfromuser（epds，event，sizeof（structepollevent）））errorEBADF；ffdget（epfd）；if（！f。file）Getthestructfileforthetargetfiletffdget（fd）；if（！tf。file）ThetargetfiledescriptormustsupportpollerrorEPERM；if（！tf。filefoppoll）CheckifEPOLLWAKEUPisallowedif（epophasevent（op））eptakecareofepollwakeup（epds）；Wehavetocheckthatthefilestructureunderneaththefiledescriptortheuserpassedtousisaneventpollfile。Andalsowedonotpermitaddinganepollfiledescriptorinsideitself。errorEINVAL；if（f。filetf。file！isfileepoll（f。file））epolladdstothewakeupqueueatEPOLLCTLADDtimeonly，soEPOLLEXCLUSIVEisnotallowedforaEPOLLCTLMODoperation。Also，wedonotcurrentlysupportednestedexclusivewakeups。if（epophasevent（op）（epds。eventsEPOLLEXCLUSIVE））｛if（opEPOLLCTLMOD）if（opEPOLLCTLADD（isfileepoll（tf。file）（epds。eventsEPOLLEXCLUSIVEOKBITS）））｝Atthispointitissafetoassumethattheprivatedatacontainsourowndatastructure。epf。Whenweinsertanepollfiledescriptor，insideanotherepollfiledescriptor，thereisthechangeofcreatingclosedloops，whicharebetterbehandledhere，thaninmorecriticalpaths。Whilewearecheckingforloopswealsodeterminethelistoffilesreachableandhangthemonthetfilechecklist，sowecancheckthatwehaventcreatedtoomanypossiblewakeuppaths。WedonotneedtotaketheglobalepumutexonEPOLLCTLADDwhentheepollfiledescriptorisattachingdirectlytoawakeupsource，unlesstheepollfiledescriptorisnested。ThepurposeoftakingtheepmutexonaddistopreventcomplextoplogiessuchasloopsanddeepwakeuppathsfromforminginparallelthroughmultipleEPOLLCTLADDoperations。mutexlocknested（epmtx，0）；if（opEPOLLCTLADD）｛if（！listempty（f。filefeplinks）isfileepoll（tf。file））｛fullcheck1；mutexunlock（epmtx）；mutexlock（epmutex）；if（isfileepoll（tf。file））｛errorELOOP；if（eploopcheck（ep，tf。file）！0）｛cleartfilechecklist（）；｝｝elselistadd（tf。fileftfilellink，tfilechecklist）；mutexlocknested（epmtx，0）；if（isfileepoll（tf。file））｛teptf。mutexlocknested（tepmtx，1）；｝｝｝TrytolookupthefileinsideourRBtree，Sincewegrabbedmtxabove，wecanbesuretobeabletousetheitemlookedupbyepfind（）tillwereleasethemutex。epiepfind（ep，tf。file，fd）；errorEINVAL；switch（op）｛caseEPOLLCTLADD：if（！epi）｛epds。eventsPOLLERRPOLLHUP；errorepinsert（ep，epds，tf。file，fd，fullcheck）；｝elseerrorEEXIST；if（fullcheck）cleartfilechecklist（）；caseEPOLLCTLDEL：if（epi）errorepremove（ep，epi）；elseerrorENOENT；caseEPOLLCTLMOD：if（epi）｛if（！（epievent。eventsEPOLLEXCLUSIVE））｛epds。eventsPOLLERRPOLLHUP；errorepmodify（ep，epi，epds）；｝｝elseerrorENOENT；｝if（tep！NULL）mutexunlock（tepmtx）；mutexunlock（epmtx）；errortgtfput：if（fullcheck）mutexunlock（epmutex）；fdput（tf）；errorfput：fdput（f）；errorreturn：｝
　　根据前面对epollctl接口的介绍，op是对epoll操作的动作（添加修改删除事件），epophasevent（op）判断是否不是删除操作，如果op！EPOLLCTLDEL为true，则需要调用copyfromuser函数将用户空间传过来的event事件拷贝到内核的epds变量中。因为，只有删除操作，内核不需要使用进程传入的event事件。
　　接着连续调用两次fdget分别获取epoll文件和被监听文件（以下称为目标文件）的file结构变量（备注：该函数返回fd结构变量，fd结构包含file结构）。
　　接下来就是对参数的一些检查，出现如下情况，就可以认为传入的参数有问题，直接返回出错：目标文件不支持poll操作（！tf。filefoppoll）；监听的目标文件就是epoll文件本身（f。filetf。file）；用户传入的epoll文件（epfd代表的文件）并不是一个真正的epoll的文件（！isfileepoll（f。file））；如果操作动作是修改操作，并且事件类型为EPOLLEXCLUSIVE，返回出错等等。
　　当然下面还有一些关于操作动作如果是添加操作的判断，这里不做解释，比较简单，自行阅读。
　　在ep里面，维护着一个红黑树，每次添加注册事件时，都会申请一个epitem结构的变量表示事件的监听项，然后插入ep的红黑树里面。在epollctl里面，会调用epfind函数从ep的红黑树里面查找目标文件表示的监听项，返回的监听项可能为空。
　　接下来switch这块区域的代码就是整个epollctl函数的核心，对op进行switch出来的有添加（EPOLLCTLADD）、删除（EPOLLCTLDEL）和修改（EPOLLCTLMOD）三种情况，这里我以添加为例讲解，其他两种情况类似，知道了如何添加监听事件，其他删除和修改监听事件都可以举一反三。
　　为目标文件添加监控事件时，首先要保证当前ep里面还没有对该目标文件进行监听，如果存在（epi不为空），就返回EEXIST错误。否则说明参数正常，然后先默认设置对目标文件的POLLERR和POLLHUP监听事件，然后调用epinsert函数，将对目标文件的监听事件插入到ep维护的红黑树里面：
　　sysepollctlepinsert：Mustbecalledwithmtxheld。staticintepinsert（structeventpollep，structepolleventevent，structfiletfile，intfd，intfullcheck）｛interror，revents，pwake0；userwatchesatomiclongread（epuserepollwatches）；if（unlikely（userwatchesmaxuserwatches））returnENOSPC；if（！（epikmemcachealloc（epicache，GFPKERNEL）））returnENOMEM；Iteminitializationfollowhere。。。INITLISTHEAD（epirdllink）；INITLISTHEAD（epifllink）；INITLISTHEAD（epipwqlist）；epsetffd（epiffd，tfile，fd）；epinwait0；epinextEPUNACTIVEPTR；if（epievent。eventsEPOLLWAKEUP）｛errorepcreatewakeupsource（epi）；if（error）｝else｛RCUINITPOINTER（epiws，NULL）；｝Initializethepolltableusingthequeuecallbackepq。initpollfuncptr（epq。pt，epptablequeueproc）；Attachtheitemtothepollhooksandgetcurrenteventbits。Wecansafelyusethefileherebecauseitsusagecounthasbeenincreasedbythecallerofthisfunction。Notethatafterthisoperationcompletes，thepollcallbackcanstarthittingthenewitem。reventsepitempoll（epi，epq。pt）；Wehavetocheckifsomethingwentwrongduringthepollwaitqueueinstallprocess。Namelyanallocationforawaitqueuefailedduehighmemorypressure。errorENOMEM；if（epinwait0）Addthecurrentitemtothelistofactiveepollhookforthisfilespinlock（tfileflock）；listaddtailrcu（epifllink，tfilefeplinks）；spinunlock（tfileflock）；AddthecurrentitemtotheRBtree。AllRBtreeoperationsareprotectedbymtx，andepinsert（）iscalledwithmtxheld。eprbtreeinsert（ep，epi）；nowcheckifwevecreatedtoomanybackpathserrorEINVAL；if（fullcheckreversepathcheck（））Wehavetodropthenewiteminsideouritemlisttokeeptrackofitspinlockirqsave（eplock，flags）；recordNAPIIDofnewitemifpresentepsetbusypollnapiid（epi）；Ifthefileisalreadyreadywedropitinsidethereadylistif（（reventseventevents）！epislinked（epirdllink））｛listaddtail（epirdllink，eprdllist）；eppmstayawake（epi）；Notifywaitingtasksthateventsareavailableif（waitqueueactive（epwq））wakeuplocked（epwq）；if（waitqueueactive（eppollwait））｝spinunlockirqrestore（eplock，flags）；atomiclonginc（epuserepollwatches）；Wehavetocallthisoutsidethelockif（pwake）eppollsafewake（eppollwait）；return0；errorremoveepi：spinlock（tfileflock）；listdelrcu（epifllink）；spinunlock（tfileflock）；rberase（epirbn，eprbr）；errorunregister：epunregisterpollwait（ep，epi）；Weneedtodothisbecauseaneventcouldhavebeenarrivedonsomeallocatedwaitqueue。Notethatwedontcareabouttheepovflistlist，sincethatisusedcleanedonlyinsideasectionboundbymtx。Andepinsert（）iscalledwithmtxheld。spinlockirqsave（eplock，flags）；if（epislinked（epirdllink））listdelinit（epirdllink）；spinunlockirqrestore（eplock，flags）；wakeupsourceunregister（epwakeupsource（epi））；errorcreatewakeupsource：kmemcachefree（epicache，epi）；｝
　　前面说过，对目标文件的监听是由一个epitem结构的监听项变量维护的，所以在epinsert函数里面，首先调用kmemcachealloc函数，从slab分配器里面分配一个epitem结构监听项，然后对该结构进行初始化，这里也没有什么好说的。我们接下来看epitempoll这个函数调用：
　　sysepollctlepinsertepitempoll：staticinlineunsignedintepitempoll（structepitemepi，polltablept）｛ptkeyepievent。returnepiffd。filefoppoll（epiffd。file，pt）epievent。｝
　　epitempoll函数里面，调用目标文件的poll函数，这个函数针对不同的目标文件而指向不同的函数，如果目标文件为套接字的话，这个poll就指向sockpoll，而如果目标文件为tcp套接字来说，这个poll就是tcppoll函数。虽然poll指向的函数可能会不同，但是其作用都是一样的，就是获取目标文件当前产生的事件位，并且将监听项绑定到目标文件的poll钩子里面（最重要的是注册epptablequeueproc这个pollcallback回调函数），这步操作完成后，以后目标文件产生事件就会调用epptablequeueproc回调函数。
　　接下来，调用listaddtailrcu将当前监听项添加到目标文件的feplinks链表里面，该链表是目标文件的epoll钩子链表，所有对该目标文件进行监听的监听项都会加入到该链表里面。
　　然后就是调用eprbtreeinsert，将epi监听项添加到ep维护的红黑树里面，这里不做解释，代码如下：
　　sysepollctlepinserteprbtreeinsert：staticvoideprbtreeinsert（structeventpollep，structepitemepi）｛structrbnodepeprbr。rbnode，parentNULL；while（p）｛epicrbentry（parent，structepitem，rbn）；kcmpepcmpffd（epiffd，epicffd）；if（kcmp0）｝rblinknode（epirbn，parent，p）；rbinsertcolor（epirbn，eprbr）；｝
　　前面提到，epinsert有调用epitempoll去获取目标文件产生的事件位，在调用epollctl前这段时间，可能会产生相关进程需要监听的事件，如果有监听的事件产生，（reventseventevents为true），并且目标文件相关的监听项没有链接到ep的准备链表rdlist里面的话，就将该监听项添加到ep的rdlist准备链表里面，rdlist链接的是该epoll描述符监听的所有已经就绪的目标文件的监听项。并且，如果有任务在等待产生事件时，就调用wakeuplocked函数唤醒所有正在等待的任务，处理相应的事件。当进程调用epollwait时，该进程就出现在ep的wq等待队列里面。接下来讲解epollwait函数。
　　总结epollctl函数：该函数根据监听的事件，为目标文件申请一个监听项，并将该监听项挂人到eventpoll结构的红黑树里面。epollwait
　　epollwait等待事件的产生，内核代码如下：
　　sysepollwait：Implementtheeventwaitinterfacefortheeventpollfile。Itisthekernelpartoftheuserspaceepollwait（2）。SYSCALLDEFINE4（epollwait，int，epfd，structepolleventuser，events，int，maxevents，int，timeout）｛Themaximumnumberofeventmustbegreaterthanzeroif（maxevents0maxeventsEPMAXEVENTS）returnEINVAL；Verifythattheareapassedbytheuseriswriteableif（！accessok（VERIFYWRITE，events，maxeventssizeof（structepollevent）））returnEFAULT；Getthestructfilefortheeventpollfileffdget（epfd）；if（！f。file）returnEBADF；Wehavetocheckthatthefilestructureunderneaththefdtheuserpassedtousisaneventpollfile。errorEINVAL；if（！isfileepoll（f。file））Atthispointitissafetoassumethattheprivatedatacontainsourowndatastructure。epf。Timetofishforevents。。。erroreppoll（ep，events，maxevents，timeout）；errorfput：fdput（f）；｝
　　首先是对进程传进来的一些参数的检查：maxevents必须大于0并且小于EPMAXEVENTS，否则就返回EINVAL；内核必须有对events变量写文件的权限，否则返回EFAULT；epfd代表的文件必须是个真正的epoll文件，否则返回EBADF。
　　参数全部检查合格后，接下来就调用eppoll函数进行真正的处理：
　　sysepollwaiteppoll：eppollRetrievesreadyevents，anddeliversthemtothecallersuppliedeventbuffer。ep：Pointertotheeventpollcontext。events：Pointertotheuserspacebufferwherethereadyeventsshouldbestored。maxevents：Size（intermsofnumberofevents）ofthecallereventbuffer。timeout：Maximumtimeoutforthereadyeventsfetchoperation，inmilliseconds。Ifthetimeoutiszero，thefunctionwillnotblock，whileifthetimeoutislessthanzero，thefunctionwillblockuntilatleastoneeventhasbeenretrieved（oranerroroccurred）。Returns：Returnsthenumberofreadyeventswhichhavebeenfetched，oranerrorcode，incaseoferror。staticinteppoll（structeventpollep，structepolleventuserevents，intmaxevents，longtimeout）｛intres0，eavail，timedout0；u64slack0；ktimetexpires，toNULL；if（timeout0）｛structtimespec64endtimeepsetmstimeout（timeout）；slackselectestimateaccuracy（endtime）；totimespec64toktime（endtime）；｝elseif（timeout0）｛Avoidtheunnecessarytriptothewaitqueueloop，ifthecallerspecifiedanonblockingoperation。timedout1；spinlockirqsave（eplock，flags）；｝fetchevents：if（！epeventsavailable（ep））epbusyloop（ep，timedout）；spinlockirqsave（eplock，flags）；if（！epeventsavailable（ep））｛Busypolltimedout。DropNAPIIDfornow，wecanadditbackinwhenwehavemovedasocketwithavalidNAPIIDontothereadylist。epresetbusypollnapiid（ep）；Wedonthaveanyavailableeventtoreturntothecaller。Weneedtosleephere，andwewillbewakeupbyeppollcallback（）wheneventswillbecomeavailable。initwaitqueueentry（wait，current）；addwaitqueueexclusive（epwq，wait）；for（；；）｛Wedontwanttosleepiftheeppollcallback（）sendsusawakeupinbetween。ThatswhywesetthetaskstatetoTASKINTERRUPTIBLEbeforedoingthechecks。setcurrentstate（TASKINTERRUPTIBLE）；if（epeventsavailable（ep）timedout）if（signalpending（current））｛resEINTR；｝spinunlockirqrestore（eplock，flags）；if（！schedulehrtimeoutrange（to，slack，HRTIMERMODEABS））timedout1；spinlockirqsave（eplock，flags）；｝removewaitqueue（epwq，wait）；setcurrentstate（TASKRUNNING）；｝checkevents：Isitworthtotrytodigforevents？eavailepeventsavailable（ep）；spinunlockirqrestore（eplock，flags）；Trytotransfereventstouserspace。Incaseweget0eventsandtheresstilltimeoutleftover，wegotryingagaininsearchofmoreluck。if（！reseavail！（resepsendevents（ep，events，maxevents））！timedout）｝
　　eppoll中首先是对等待时间的处理，timeout超时时间以ms为单位，timeout大于0，说明等待timeout时间后超时，如果timeout等于0，函数不阻塞，直接返回，小于0的情况，是永久阻塞，直到有事件产生才返回。
　　当没有事件产生时（（！epeventsavailable（ep））为true），调用addwaitqueueexclusive函数将当前进程加入到epwq等待队列里面，然后在一个无限for循环里面，首先调用setcurrentstate（TASKINTERRUPTIBLE），将当前进程设置为可中断的睡眠状态，然后当前进程就让出cpu，进入睡眠，直到有其他进程调用wakeup或者有中断信号进来唤醒本进程，它才会去执行接下来的代码。
　　如果进程被唤醒后，首先检查是否有事件产生，或者是否出现超时还是被其他信号唤醒的。如果出现这些情况，就跳出循环，将当前进程从epwp的等待队列里面移除，并且将当前进程设置为TASKRUNNING就绪状态。
　　如果真的有事件产生，就调用epsendevents函数，将events事件转移到用户空间里面。
　　sysepollwaiteppollepsendevents：staticintepsendevents（structeventpollep，structepolleventuserevents，intmaxevents）｛esed。esed。returnepscanreadylist（ep，epsendeventsproc，esed，0，false）；｝
　　epsendevents没有什么工作，真正的工作是在epscanreadylist函数里面：
　　sysepollwaiteppollepsendeventsepscanreadylist：epscanreadylistScansthereadylistinawaythatmakespossibleforthescancode，tocallfoppoll（）。AlsoallowsforO（NumReady）performance。ep：Pointertotheepollprivatedatastructure。sproc：Pointertothescancallback。priv：Privateopaquedatapassedtothesproccallback。depth：Thecurrentdepthofrecursivefoppollcalls。eplocked：calleralreadyholdsepmtxReturns：Thesameintegererrorcodereturnedbythesproccallback。staticintepscanreadylist（structeventpollep，int（sproc）（structeventpoll，structlisthead，void），voidpriv，intdepth，booleplocked）｛interror，pwake0；structepitemepi，LISTHEAD（txlist）；Weneedtolockthisbecausewecouldbehitbyeventpollreleasefile（）andepollctl（）。if（！eplocked）mutexlocknested（epmtx，depth）；Stealthereadylist，andreinittheoriginalonetotheemptylist。Also，setepovflisttoNULLsothateventshappeningwhileloopingwoutlocks，arenotlost。Wecannothavethepollcallbacktoqueuedirectlyoneprdllist，becausewewantthesproccallbacktobeabletodoitinalocklessway。spinlockirqsave（eplock，flags）；listspliceinit（eprdllist，txlist）；epovflistNULL；spinunlockirqrestore（eplock，flags）；Nowcallthecallbackfunction。error（sproc）（ep，txlist，priv）；spinlockirqsave（eplock，flags）；Duringthetimewespentinsidethesproccallback，someothereventsmighthavebeenqueuedbythepollcallback。Wereinserttheminsidethemainreadylisthere。for（（epinepi）！NULL；nepiepinext，epinextEPUNACTIVEPTR）｛Weneedtocheckiftheitemisalreadyinthelist。Duringthesproccallbackexecutiontime，itemsarequeuedintoovflistbutthetxlistmightalreadycontainthem，andthelistsplice（）belowtakescareofthem。if（！epislinked（epirdllink））｛listaddtail（epirdllink，eprdllist）；eppmstayawake（epi）；｝｝WeneedtosetbackepovflisttoEPUNACTIVEPTR，sothatafterreleasingthelock，eventswillbequeuedinthenormalwayinsideeprdllist。epovflistEPUNACTIVEPTR；Quicklyreinjectitemsleftontxlist。listsplice（txlist，eprdllist）；pmrelax（epws）；if（！listempty（eprdllist））｛Wakeup（ifactive）boththeeventpollwaitlistandthepoll（）waitlist（delayedafterwereleasethelock）。if（waitqueueactive（epwq））wakeuplocked（epwq）；if（waitqueueactive（eppollwait））｝spinunlockirqrestore（eplock，flags）；if（！eplocked）mutexunlock（epmtx）；Wehavetocallthisoutsidethelockif（pwake）eppollsafewake（eppollwait）；｝
　　epscanreadylist首先将ep就绪链表里面的数据链接到一个全局的txlist里面，然后清空ep的就绪链表，同时还将ep的ovflist链表设置为NULL，ovflist是用单链表，是一个接受就绪事件的备份链表，当内核进程将事件从内核拷贝到用户空间时，这段时间目标文件可能会产生新的事件，这个时候，就需要将新的时间链入到ovlist里面。
　　仅接着，调用sproc回调函数（这里将调用epsendeventsproc函数）将事件数据从内核拷贝到用户空间。
　　sysepollwaiteppollepsendeventsepscanreadylistepsendeventsproc：staticintepsendeventsproc（structeventpollep，structlistheadhead，voidpriv）｛initpollfuncptr（pt，NULL）；Wecanloopwithoutlockbecausewearepassedataskprivatelist。Itemscannotvanishduringtheloopbecauseepscanreadylist（）isholdingmtxduringthiscall。for（eventcnt0，！listempty（head））｛epilistfirstentry（head，structepitem，rdllink）；Activateepwsbeforedeactivatingepiwstopreventtriggeringautosuspendhere（incasewereactiveepiwsbelow）。Thiscouldberearrangedtodelaythedeactivationofepiwsinstead，butthenepiwswouldtemporarilybeoutofsyncwithepislinked（）。wsepwakeupsource（epi）；if（ws）｛if（wsactive）pmstayawake（epws）；pmrelax（ws）；｝listdelinit（epirdllink）；reventsepitempoll（epi，pt）；Iftheeventmaskintersectthecallerrequestedone，delivertheeventtouserspace。Again，epscanreadylist（）isholdingmtx，sonooperationscomingfromuserspacecanchangetheitem。if（revents）｛if（putuser（revents，ueventevents）putuser（epievent。data，ueventdata））｛listadd（epirdllink，head）；eppmstayawake（epi）；returneventcnt？eventcnt：EFAULT；｝if（epievent。eventsEPOLLONESHOT）epievent。eventsEPPRIVATEBITS；elseif（！（epievent。eventsEPOLLET））｛IfthisfilehasbeenaddedwithLevelTriggermode，weneedtoinsertbackinsidethereadylist，sothatthenextcalltoepollwait（）willcheckagaintheeventsavailability。Atthispoint，noonecaninsertintoeprdllistbesidesus。Theepollctl（）callersarelockedoutbyepscanreadylist（）holdingmtxandthepollcallbackwillqueuetheminepovflist。listaddtail（epirdllink，eprdllist）；eppmstayawake（epi）；｝｝｝｝
　　epsendeventsproc回调函数循环获取监听项的事件数据，对每个监听项，调用epitempoll获取监听到的目标文件的事件，如果获取到事件，就调用putuser函数将数据拷贝到用户空间。
　　回到epscanreadylist函数，上面说到，在sproc回调函数执行期间，目标文件可能会产生新的事件链入ovlist链表里面，所以，在回调结束后，需要重新将ovlist链表里面的事件添加到rdllist就绪事件链表里面。
　　同时在最后，如果rdlist不为空（表示是否有就绪事件），并且由进程等待该事件，就调用wakeuplocked再一次唤醒内核进程处理事件的到达（流程跟前面一样，也就是将事件拷贝到用户空间）。
　　到这，epollwait的流程是结束了，但是有一个问题，就是前面提到的进程调用epollwait后会睡眠，但是这个进程什么时候被唤醒呢？在调用epollctl为目标文件注册监听项时，对目标文件的监听项注册一个epptablequeueproc回调函数，epptablequeueproc回调函数将进程添加到目标文件的wakeup链表里面，并且注册eppollcallbak回调，当目标文件产生事件时，eppollcallbak回调就去唤醒等待队列里面的进程。
　　总结一下epoll该函数：epollwait函数会使调用它的进程进入睡眠（timeout为0时除外），如果有监听的事件产生，该进程就被唤醒，同时将事件从内核里面拷贝到用户空间返回给该进程。

投诉评论转载

安全知识手抄报一年级的怎么写游泳、玩耍、踩踏事件，就轻易的夺走了一个个脆弱的生命，一朵朵祖国未来美丽的花。生命真的很脆弱，禁不起一点点的风吹雨打。可是往往因为我们的不小心，就会酿成一场悲剧的发生，那么如何……谈谈新时期如何构建和谐企业党建和工会建设摘要：企业健康发展是推动国家经济发展的重要因素，企业内部发展是否和谐在一定程度上决定了企业是否健康发展。在市场竞争不断深入的前提下，企业整体素质也越来越重要。和谐企业能够促进职……论流行文化视阈下高校青年价值观的共生导向机制郑玲玲吴巨慧浙江大学摘要：在全球化浪潮和多元文化冲击下，基于营销传播共生理论衍生出的流行文化视阈下高校青年价值观导向机制之罗盘模型，以高校青年为主体，从三大圈层结构、六大……2014年，志愿军烈士人数才最终确认，大部分葬在朝鲜，四川省2000年10月16的《解放军报》上，有一个版面分外引人注目，上面是截至当年为止在朝鲜战场上英勇牺牲的志愿军人数，当时这个数字是171687人。但当时碍于通讯条件不发达，……智能桌椅悬浮书包功能性儿童学习用品再升级学习桌椅、护眼灯、护脊书包，父母对子女的健康学习环境问题愈加重视，功能性儿童学习用品迎来高速发展。预计2022年功能性儿童学习用品市场规模238亿元，同比增长26。4，其中儿童……面试必问的epoll技术，从内核源码出发彻底搞懂epollepoll概述epoll是linux中IO多路复用的一种机制，IO多路复用就是通过一种机制，一个进程可以监视多个描述符，一旦某个描述符就绪（一般是读就绪或者写就绪），能够……搁浅造句用搁浅造句大全一百八十一、这条名为“唐河”的河流经过县城一段平时河道不过二三十米宽、最深处大约只有一米多点，捕鱼的撑只小划子一不留神就搁浅。一百八十二、针对近期出现的“绿色GDP计划可……天文宇宙科学家将引力波探测器技术运用到了新型暗物质探寻上研究：若Lyra概念任务能在2028年前发射，那么可在26年内追上奥陌陌詹姆斯韦伯太空望远镜进入自我冷却阶段调试工作即将开始NASAUSGSLandsat9卫星现已投入使用成像……生命权的构成是什么生命权是指以自然人的生命安全利益为内容的权利。生命权是法律保护的最高权利形态。生命的丧失是侵害生命权的结果。那么生命权的构成是什么呢？中国公民个人享有哪些人身权利呢？下面本站就……一言难尽！OPPOFindX5Pro值得入手吗？刚刚发布的OPPOFindX5Pro，到底是挤牙膏还是迭代更新？前面也有粉丝问，想入手OPPOfindX系列。阿城也是拿到了新机，体验了入手！外观上有三款配色，蓝、黑、白……家用电器用电误区您中招了吗现代家庭家用电器多，需要使用的插座也多了，但是为了省电，关于电器用电方面的一些误区，你了解过多少呢？接下来小编为您介绍家用电器用电误区，您中招了吗？家庭用电六大误区……幽门螺杆菌出现时，嘴巴或许会出现5个异常，不妨了解一下随着生活水平不断的提高，大鱼大肉成为餐桌上的常客，大家因为贪图一时之快，却让肠胃来买单，糟糕的饮食习惯，会使肠胃受到威胁，最终患上各种各样的问题，幽门螺旋杆菌也是肠道很常见的一……

<<<<<<－>>>>>>

情意造句用情意造句大全西游记是1999年发布的一款旗帜游戏留住特岗教师需要更多精神激励 27分7助PK24分3板！亚洲第一后卫之争郭艾伦真输给他俩了 2025年新能源将崛起？燃油车会被淘汰吗？原神3。1五星四星武器，万叶新专武，新原魔太抽象了，但挺帅设计方案造句用设计方案造句大全原神3。1前瞻必看！原神全新武器一览支部第三季度三会一课工作总结官宣索尼Xperia新品全球发布会将于9月1日1500举行年度新客户开发工作总结隔夜茶到底能不能喝？对身体是否有危害？你会喝隔夜茶吗？

友情链接：中准网聚热点快百科快传网快生活快软网快好知文好找江西南阳嘉兴昆明铜陵滨州广东西昌常德梅州兰州阳江运城金华广西萍乡大理重庆诸暨泉州安庆南充武汉辽宁